Search Index for large # of PDF's SOLVED
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Search Index for large # of PDF's
I had planned to put a large (200+) collection of magazines online via a site that specializes in search and builds PDF's from individual magazine pages. Unfortunately the publisher and original owner has taken legal steps to stop that.
I'm left with a large number of PDFs most of which have 50-pages, I also have about a dozen books that have 150-400 pages. PDF-Xchange is great at searching across these PDF's to find names, terms etc. However, it's getting slower and slower.
Does anyone know of a tool, or app that could improve the performance?
I can put the pdf's on a shared NAS and run a server but would prefer not to break-up the PDF's now. Ideas?
++Mark.
https://ctproduced.com
I'm left with a large number of PDFs most of which have 50-pages, I also have about a dozen books that have 150-400 pages. PDF-Xchange is great at searching across these PDF's to find names, terms etc. However, it's getting slower and slower.
Does anyone know of a tool, or app that could improve the performance?
I can put the pdf's on a shared NAS and run a server but would prefer not to break-up the PDF's now. Ideas?
++Mark.
https://ctproduced.com
-
- Site Admin
- Posts: 11266
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Search Index for large # of PDF's
Hello, 4mc
If you are searching a very large quantity of files and are finding that it is beginning to take a very long time, my first step would be to check the files themselves, perhaps you can "save as optimized" the files to reduce the excess data and allow the search to operate faster with less extraneous content to search through.
Another possibility is that, even with indexing, this is still a process which is heavy on your storage drives, and local processor, it may be time to consider some hardware upgrades to increase the speed of actions like this.
Kind regards,
If you are searching a very large quantity of files and are finding that it is beginning to take a very long time, my first step would be to check the files themselves, perhaps you can "save as optimized" the files to reduce the excess data and allow the search to operate faster with less extraneous content to search through.
Another possibility is that, even with indexing, this is still a process which is heavy on your storage drives, and local processor, it may be time to consider some hardware upgrades to increase the speed of actions like this.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
Saving as optimized is possible, but would require a duplicate set of PDF's. As per this discussion, the quality is also important.
viewtopic.php?t=40365
My current processor is an 11th Gen Intel 8-Core i7 with 32GB RAM, so thats not the issue.
I was hoping someone had tackled this problem before. There is some discussion about it on the Adobe forums but the solutions are only relevant to Adobe products. Adobe also allows catalogs and creating a unified index of the catalog which would be very interesting.
I'd rather find a non-Adobe solution and was somewhat obscure in my request to solicit ideas from other users of PDF-Xchange. PDFMiner seems like a starting point. There are lots of suggestions on Stackoverflow https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library/8325135#8325135
I was hoping for something more ready to go. I've tried numerous online solutions, but that is no use as I can't put the PDF's online(see https://worldradiohistory.com/Archive-All-Music/Down-Beat.htm).
viewtopic.php?t=40365
My current processor is an 11th Gen Intel 8-Core i7 with 32GB RAM, so thats not the issue.
I was hoping someone had tackled this problem before. There is some discussion about it on the Adobe forums but the solutions are only relevant to Adobe products. Adobe also allows catalogs and creating a unified index of the catalog which would be very interesting.
I'd rather find a non-Adobe solution and was somewhat obscure in my request to solicit ideas from other users of PDF-Xchange. PDFMiner seems like a starting point. There are lots of suggestions on Stackoverflow https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library/8325135#8325135
I was hoping for something more ready to go. I've tried numerous online solutions, but that is no use as I can't put the PDF's online(see https://worldradiohistory.com/Archive-All-Music/Down-Beat.htm).
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
I would say, that running Windows 11 and using the file manager search works noticeably(it seems) faster than PDF-Xchange, it returns lists of magazines but with no context or adjacent text. The only option is to select all and open then find.
Running a PDF-Xchange search against \Book scans\ takes 7-minutes and 6-seconds to find 195 documents and 871 results. Windows File Manager search over exactly the same finishes in less than 5-seconds.
Running a PDF-Xchange search against \Book scans\ takes 7-minutes and 6-seconds to find 195 documents and 871 results. Windows File Manager search over exactly the same finishes in less than 5-seconds.
-
- Site Admin
- Posts: 11266
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Search Index for large # of PDF's SOLVED
Hello, 4mc
Thank you for clarifying. I can say that at the moment, between the fact that, we do not have indexing, and that we search through a great deal more data than the windows search is capable of, it is still not unexpected that around 200 documents takes a notable time to search through, reducing the search breadth (such as disabling search of bookmarks) may give a notable improvement in search speed, at the cost of not including those items in the results.
Beyond that, I should note that we are beginning work on indexing functions, so while I cannot offer a timeline, it is looking like something I can say will eventually be coming down the pipeline.
Kind regards,
Thank you for clarifying. I can say that at the moment, between the fact that, we do not have indexing, and that we search through a great deal more data than the windows search is capable of, it is still not unexpected that around 200 documents takes a notable time to search through, reducing the search breadth (such as disabling search of bookmarks) may give a notable improvement in search speed, at the cost of not including those items in the results.
Beyond that, I should note that we are beginning work on indexing functions, so while I cannot offer a timeline, it is looking like something I can say will eventually be coming down the pipeline.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
Thanks Daniel.
Since in most cases the books, magazines are scanned and then searched, they typically only have the OCR text and the image data. At least from my perspective. I don't add bookmarks or anything else. I don't even add to the properties.
While I'd be interested in an all encompassing search index, my primary need is an index of the OCR data created by PDF-Xchange and the other fields and information would be secondary or tertiary.
For now the Windows File Explorer does a good job on the OCR text. That said, it comes with a cost when NOT wanting the search to include .pdf OCR data. This would be primarily why I was hoping for another solution even if it wasn't part of PDF-Xchange, let alone part of the core.
Since in most cases the books, magazines are scanned and then searched, they typically only have the OCR text and the image data. At least from my perspective. I don't add bookmarks or anything else. I don't even add to the properties.
While I'd be interested in an all encompassing search index, my primary need is an index of the OCR data created by PDF-Xchange and the other fields and information would be secondary or tertiary.
For now the Windows File Explorer does a good job on the OCR text. That said, it comes with a cost when NOT wanting the search to include .pdf OCR data. This would be primarily why I was hoping for another solution even if it wasn't part of PDF-Xchange, let alone part of the core.
-
- Site Admin
- Posts: 11266
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Search Index for large # of PDF's
Hello, 4mc
If you want to prevent the page content from being searched (effectively only searching for titles) one option would be to disable the internal search terms and enable the option to only search document info (which should include the file name and title): Another would be to use our shell utility to disable our ifilter extension, this article details how to re-enable all extensions, but if you use the GUI option, you can disable just a single one of them: https://www.pdf-xchange.com/knowle ... extensions
Kind regards,
If you want to prevent the page content from being searched (effectively only searching for titles) one option would be to disable the internal search terms and enable the option to only search document info (which should include the file name and title): Another would be to use our shell utility to disable our ifilter extension, this article details how to re-enable all extensions, but if you use the GUI option, you can disable just a single one of them: https://www.pdf-xchange.com/knowle ... extensions
Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
Just a quick update on this. I now have a collection of some 4,500 music magazines, all with OCR text, many but not all I have scanned in locally. They equate to some 60GB of data, now mostly static.
Searching for a simple, uncommon word or name, for example Tamiko - Deodato on a CORE 4 Intel i7 with 32GB, takes in excess of a whopping 1-hour to return 753 Documents and 2,565 entries.
I can split the search up because the magazines are sorted by year, each magazine in a folder with subfolders of decade and then year. I often have an approximate date to work on, 1972-1975 for example, I either search on the entire 1970's or the four year folders individually.
I know some websites have a search ability that includes PDF's, I assume there must be some structured way to extract the OCR text, and build a search index from the text and then be able to search it to find PDF's.
It would be great if PDF X-Change had an add-on but I'd be happy to find a 3rd party product that could do it. Anyone?
I'm going to have a play with DocFetcher which seems to have the right sort of features.
https://docfetcher.sourceforge.io/en/index.html
also FileLocatorPro may well work.
https://www.mythicsoft.com/filelocatorpro/
Searching for a simple, uncommon word or name, for example Tamiko - Deodato on a CORE 4 Intel i7 with 32GB, takes in excess of a whopping 1-hour to return 753 Documents and 2,565 entries.
I can split the search up because the magazines are sorted by year, each magazine in a folder with subfolders of decade and then year. I often have an approximate date to work on, 1972-1975 for example, I either search on the entire 1970's or the four year folders individually.
I know some websites have a search ability that includes PDF's, I assume there must be some structured way to extract the OCR text, and build a search index from the text and then be able to search it to find PDF's.
It would be great if PDF X-Change had an add-on but I'd be happy to find a 3rd party product that could do it. Anyone?
I'm going to have a play with DocFetcher which seems to have the right sort of features.
https://docfetcher.sourceforge.io/en/index.html
also FileLocatorPro may well work.
https://www.mythicsoft.com/filelocatorpro/
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
>>also FileLocatorPro may well work.
>>https://www.mythicsoft.com/filelocatorpro/
Well solution found. I didn't try DocFetcher purely because it required Java to be installed and I don't have it and didn't want to install for a trial.
I installed the trial version of FileLocator Pro. and repeated the same search, using the same target library of PDF's.
Result was the same result but in 53-minutes.
I used the FileLocator Pro(trial) to build an index of the whole PDF, which took more than an hour.
Doing the same search for Deodato against the index too less than 1-second and produced the same result set.
Doing a more complex query:
Search: "Creed Taylor" NEAR "Clarence Avant" (finds only where Creed Taylor and Clarence Avant are on the same page)
Index: June 23 2024 Index
Search Statistics
Found: 3 items (41.95 MB)
Status: Completed (< 1 sec)
Completed: 6/24/2024 10:05:39 AM
Name Location Modified Size Type Hits
Billboard 1967-06-10.pdf E:\Magazines ! Serials\Billboard Magazine\60s\1967\ 8/31/2023 11:39:47 AM 23,071 KB PDF Document 4
CB-1965-08-07.pdf E:\Magazines ! Serials\CashBox Magazine\60s\1965\ 9/9/2023 6:29:40 PM 13,819 KB PDF Document 7
RW-1967-06-10.pdf E:\Magazines ! Serials\Record World Magazine\1967\ 5/6/2024 10:38:01 PM 6,068 KB PDF Document 2
Outstanding! This will be a real game changer for my work. A license for FileLocator Pro is $69
>>https://www.mythicsoft.com/filelocatorpro/
Well solution found. I didn't try DocFetcher purely because it required Java to be installed and I don't have it and didn't want to install for a trial.
I installed the trial version of FileLocator Pro. and repeated the same search, using the same target library of PDF's.
Result was the same result but in 53-minutes.
I used the FileLocator Pro(trial) to build an index of the whole PDF, which took more than an hour.
Doing the same search for Deodato against the index too less than 1-second and produced the same result set.
Doing a more complex query:
Search: "Creed Taylor" NEAR "Clarence Avant" (finds only where Creed Taylor and Clarence Avant are on the same page)
Index: June 23 2024 Index
Search Statistics
Found: 3 items (41.95 MB)
Status: Completed (< 1 sec)
Completed: 6/24/2024 10:05:39 AM
Name Location Modified Size Type Hits
Billboard 1967-06-10.pdf E:\Magazines ! Serials\Billboard Magazine\60s\1967\ 8/31/2023 11:39:47 AM 23,071 KB PDF Document 4
CB-1965-08-07.pdf E:\Magazines ! Serials\CashBox Magazine\60s\1965\ 9/9/2023 6:29:40 PM 13,819 KB PDF Document 7
RW-1967-06-10.pdf E:\Magazines ! Serials\Record World Magazine\1967\ 5/6/2024 10:38:01 PM 6,068 KB PDF Document 2
Outstanding! This will be a real game changer for my work. A license for FileLocator Pro is $69
-
- Site Admin
- Posts: 11266
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Search Index for large # of PDF's
Hello, 4mc
I am glad to hear you have found a solution. I do hope that one day we can improve on this front, but it will certainly not be something we can do overnight, search speed improvements will be gradual, and in time I am sure the Dev team will come up with a solution to the indexing item we see frequent requests for.
Kind regards,
I am glad to hear you have found a solution. I do hope that one day we can improve on this front, but it will certainly not be something we can do overnight, search speed improvements will be gradual, and in time I am sure the Dev team will come up with a solution to the indexing item we see frequent requests for.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
I don't expect miracles 
I'm just in the final steps of rolling out a 2TB SSD with over 7,000 PDF's about 70-books, the rest are magazines and serials. A typical search for names and Boolean searches NEAR / NOT etc. still completes in less than 5-seconds using a pre-built Filelocator Pro index.
I am including a brief document on the SSD archive in which I am recommending my 4-collegaues to get a license for both PDF X-Change Editor Plus and Filelocator Pro. I will be shipping a pre-built Index on the SSD. Building the Index takes about 3-hours but is easily done last thing in the day. It's has revolutionized my ability to find information without being at the whim of giant websites..
Filelocator Pro is a product of Mythicsoft Ltd. based in Cambridge UK.
I've no idea what either your business model, or theirs is, I would have though been very enthusiastic to purchase a plug-in for PDF X-Change that contained a subset of their product and index manager that would have worked with just PDF's.
Instead of building it from scratch, maybe it would be worth exploring with them.
++Mark.
https://ctproduced.com

I'm just in the final steps of rolling out a 2TB SSD with over 7,000 PDF's about 70-books, the rest are magazines and serials. A typical search for names and Boolean searches NEAR / NOT etc. still completes in less than 5-seconds using a pre-built Filelocator Pro index.
I am including a brief document on the SSD archive in which I am recommending my 4-collegaues to get a license for both PDF X-Change Editor Plus and Filelocator Pro. I will be shipping a pre-built Index on the SSD. Building the Index takes about 3-hours but is easily done last thing in the day. It's has revolutionized my ability to find information without being at the whim of giant websites..
Filelocator Pro is a product of Mythicsoft Ltd. based in Cambridge UK.
I've no idea what either your business model, or theirs is, I would have though been very enthusiastic to purchase a plug-in for PDF X-Change that contained a subset of their product and index manager that would have worked with just PDF's.
Instead of building it from scratch, maybe it would be worth exploring with them.
++Mark.
https://ctproduced.com
-
- Site Admin
- Posts: 11266
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Search Index for large # of PDF's
Hello, 4mc
That would be a fantastic solution if it is possible. As usual, this information has been passed up to the Dev team for review, I expect that when he has time, our Dev team lead will explore if that suggestion would be viable for us to pursue.
Kind regards,
That would be a fantastic solution if it is possible. As usual, this information has been passed up to the Dev team for review, I expect that when he has time, our Dev team lead will explore if that suggestion would be viable for us to pursue.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 56
- Joined: Tue Apr 27, 2021 12:42 am
Re: Search Index for large # of PDF's
Thanks you. The Filelocator Pro index has become so good and so useful I've added dozens of full books, and a few thousand more magazines. Searches still take less than 1-second. It has changed everything I do, seriously. I have a music library of some 30,000 MP3, FLAC format files and a huge structured set of folders, I've even created a full index of the music files and it parses the meta data and allows me to search for tracks much quicker that I can't navigate to the folder even when I know where it is.
Here is my magazine and book index:
Index size: 1445.46 MB | Indexed count: 10,112 | Name only: 0 | Total: 10,112
a search: "Jimmy Smith" NEAR March 1973
results:
Found: 15 items (388.51 MB)
Status: Completed (< 1 sec)
Completed: 9/12/2024 1:15:15 PM
Here is my magazine and book index:
Index size: 1445.46 MB | Indexed count: 10,112 | Name only: 0 | Total: 10,112
a search: "Jimmy Smith" NEAR March 1973
results:
Found: 15 items (388.51 MB)
Status: Completed (< 1 sec)
Completed: 9/12/2024 1:15:15 PM
-
- Site Admin
- Posts: 7370
- Joined: Wed Mar 25, 2009 10:37 pm
Re: Search Index for large # of PDF's
Hi, 4mc
this sounds great. I would advise you to pop back here from time to time over the next months to check on the progress.
Reality is that the devs all have a lot on their plate(s) and priorities are sometimes shifting, and I do not expect this to be the highest.
Lets see how things progress over the next few months and give us a bump if there is no movement?
this sounds great. I would advise you to pop back here from time to time over the next months to check on the progress.
Reality is that the devs all have a lot on their plate(s) and priorities are sometimes shifting, and I do not expect this to be the highest.
Lets see how things progress over the next few months and give us a bump if there is no movement?
Best regards
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com