Search Index for large # of PDF's  SOLVED

This Forum is for the use of End Users requiring help and assistance for Tracker Software's PDF-Tools.

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange

4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Search Index for large # of PDF's

Post by 4mc »

I had planned to put a large (200+) collection of magazines online via a site that specializes in search and builds PDF's from individual magazine pages. Unfortunately the publisher and original owner has taken legal steps to stop that.

I'm left with a large number of PDFs most of which have 50-pages, I also have about a dozen books that have 150-400 pages. PDF-Xchange is great at searching across these PDF's to find names, terms etc. However, it's getting slower and slower.

Does anyone know of a tool, or app that could improve the performance?

I can put the pdf's on a shared NAS and run a server but would prefer not to break-up the PDF's now. Ideas?

++Mark.
https://ctproduced.com
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11266
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's

Post by Daniel - PDF-XChange »

Hello, 4mc

If you are searching a very large quantity of files and are finding that it is beginning to take a very long time, my first step would be to check the files themselves, perhaps you can "save as optimized" the files to reduce the excess data and allow the search to operate faster with less extraneous content to search through.

Another possibility is that, even with indexing, this is still a process which is heavy on your storage drives, and local processor, it may be time to consider some hardware upgrades to increase the speed of actions like this.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

Saving as optimized is possible, but would require a duplicate set of PDF's. As per this discussion, the quality is also important.
viewtopic.php?t=40365

My current processor is an 11th Gen Intel 8-Core i7 with 32GB RAM, so thats not the issue.

I was hoping someone had tackled this problem before. There is some discussion about it on the Adobe forums but the solutions are only relevant to Adobe products. Adobe also allows catalogs and creating a unified index of the catalog which would be very interesting.

I'd rather find a non-Adobe solution and was somewhat obscure in my request to solicit ideas from other users of PDF-Xchange. PDFMiner seems like a starting point. There are lots of suggestions on Stackoverflow https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library/8325135#8325135

I was hoping for something more ready to go. I've tried numerous online solutions, but that is no use as I can't put the PDF's online(see https://worldradiohistory.com/Archive-All-Music/Down-Beat.htm).
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

I would say, that running Windows 11 and using the file manager search works noticeably(it seems) faster than PDF-Xchange, it returns lists of magazines but with no context or adjacent text. The only option is to select all and open then find.

Running a PDF-Xchange search against \Book scans\ takes 7-minutes and 6-seconds to find 195 documents and 871 results. Windows File Manager search over exactly the same finishes in less than 5-seconds.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11266
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's  SOLVED

Post by Daniel - PDF-XChange »

Hello, 4mc

Thank you for clarifying. I can say that at the moment, between the fact that, we do not have indexing, and that we search through a great deal more data than the windows search is capable of, it is still not unexpected that around 200 documents takes a notable time to search through, reducing the search breadth (such as disabling search of bookmarks) may give a notable improvement in search speed, at the cost of not including those items in the results.

Beyond that, I should note that we are beginning work on indexing functions, so while I cannot offer a timeline, it is looking like something I can say will eventually be coming down the pipeline.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

Thanks Daniel.

Since in most cases the books, magazines are scanned and then searched, they typically only have the OCR text and the image data. At least from my perspective. I don't add bookmarks or anything else. I don't even add to the properties.

While I'd be interested in an all encompassing search index, my primary need is an index of the OCR data created by PDF-Xchange and the other fields and information would be secondary or tertiary.

For now the Windows File Explorer does a good job on the OCR text. That said, it comes with a cost when NOT wanting the search to include .pdf OCR data. This would be primarily why I was hoping for another solution even if it wasn't part of PDF-Xchange, let alone part of the core.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11266
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's

Post by Daniel - PDF-XChange »

Hello, 4mc

If you want to prevent the page content from being searched (effectively only searching for titles) one option would be to disable the internal search terms and enable the option to only search document info (which should include the file name and title):
image.png
Another would be to use our shell utility to disable our ifilter extension, this article details how to re-enable all extensions, but if you use the GUI option, you can disable just a single one of them: https://www.pdf-xchange.com/knowle ... extensions

Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

Just a quick update on this. I now have a collection of some 4,500 music magazines, all with OCR text, many but not all I have scanned in locally. They equate to some 60GB of data, now mostly static.

Searching for a simple, uncommon word or name, for example Tamiko - Deodato on a CORE 4 Intel i7 with 32GB, takes in excess of a whopping 1-hour to return 753 Documents and 2,565 entries.

I can split the search up because the magazines are sorted by year, each magazine in a folder with subfolders of decade and then year. I often have an approximate date to work on, 1972-1975 for example, I either search on the entire 1970's or the four year folders individually.

I know some websites have a search ability that includes PDF's, I assume there must be some structured way to extract the OCR text, and build a search index from the text and then be able to search it to find PDF's.

It would be great if PDF X-Change had an add-on but I'd be happy to find a 3rd party product that could do it. Anyone?

I'm going to have a play with DocFetcher which seems to have the right sort of features.
https://docfetcher.sourceforge.io/en/index.html

also FileLocatorPro may well work.
https://www.mythicsoft.com/filelocatorpro/
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

>>also FileLocatorPro may well work.
>>https://www.mythicsoft.com/filelocatorpro/

Well solution found. I didn't try DocFetcher purely because it required Java to be installed and I don't have it and didn't want to install for a trial.

I installed the trial version of FileLocator Pro. and repeated the same search, using the same target library of PDF's.
Result was the same result but in 53-minutes.

I used the FileLocator Pro(trial) to build an index of the whole PDF, which took more than an hour.

Doing the same search for Deodato against the index too less than 1-second and produced the same result set.

Doing a more complex query:

Search: "Creed Taylor" NEAR "Clarence Avant" (finds only where Creed Taylor and Clarence Avant are on the same page)
Index: June 23 2024 Index


Search Statistics

Found: 3 items (41.95 MB)
Status: Completed (< 1 sec)
Completed: 6/24/2024 10:05:39 AM


Name Location Modified Size Type Hits

Billboard 1967-06-10.pdf E:\Magazines ! Serials\Billboard Magazine\60s\1967\ 8/31/2023 11:39:47 AM 23,071 KB PDF Document 4
CB-1965-08-07.pdf E:\Magazines ! Serials\CashBox Magazine\60s\1965\ 9/9/2023 6:29:40 PM 13,819 KB PDF Document 7
RW-1967-06-10.pdf E:\Magazines ! Serials\Record World Magazine\1967\ 5/6/2024 10:38:01 PM 6,068 KB PDF Document 2

Outstanding! This will be a real game changer for my work. A license for FileLocator Pro is $69
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11266
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's

Post by Daniel - PDF-XChange »

Hello, 4mc

I am glad to hear you have found a solution. I do hope that one day we can improve on this front, but it will certainly not be something we can do overnight, search speed improvements will be gradual, and in time I am sure the Dev team will come up with a solution to the indexing item we see frequent requests for.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

I don't expect miracles ;-)

I'm just in the final steps of rolling out a 2TB SSD with over 7,000 PDF's about 70-books, the rest are magazines and serials. A typical search for names and Boolean searches NEAR / NOT etc. still completes in less than 5-seconds using a pre-built Filelocator Pro index.

I am including a brief document on the SSD archive in which I am recommending my 4-collegaues to get a license for both PDF X-Change Editor Plus and Filelocator Pro. I will be shipping a pre-built Index on the SSD. Building the Index takes about 3-hours but is easily done last thing in the day. It's has revolutionized my ability to find information without being at the whim of giant websites..

Filelocator Pro is a product of Mythicsoft Ltd. based in Cambridge UK.

I've no idea what either your business model, or theirs is, I would have though been very enthusiastic to purchase a plug-in for PDF X-Change that contained a subset of their product and index manager that would have worked with just PDF's.

Instead of building it from scratch, maybe it would be worth exploring with them.

++Mark.
https://ctproduced.com
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11266
Joined: Wed Jan 03, 2018 6:52 pm

Re: Search Index for large # of PDF's

Post by Daniel - PDF-XChange »

Hello, 4mc

That would be a fantastic solution if it is possible. As usual, this information has been passed up to the Dev team for review, I expect that when he has time, our Dev team lead will explore if that suggestion would be viable for us to pursue.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
4mc
User
Posts: 56
Joined: Tue Apr 27, 2021 12:42 am

Re: Search Index for large # of PDF's

Post by 4mc »

Thanks you. The Filelocator Pro index has become so good and so useful I've added dozens of full books, and a few thousand more magazines. Searches still take less than 1-second. It has changed everything I do, seriously. I have a music library of some 30,000 MP3, FLAC format files and a huge structured set of folders, I've even created a full index of the music files and it parses the meta data and allows me to search for tracks much quicker that I can't navigate to the folder even when I know where it is.

Here is my magazine and book index:
Index size: 1445.46 MB | Indexed count: 10,112 | Name only: 0 | Total: 10,112

a search: "Jimmy Smith" NEAR March 1973
results:
Found: 15 items (388.51 MB)
Status: Completed (< 1 sec)
Completed: 9/12/2024 1:15:15 PM
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7370
Joined: Wed Mar 25, 2009 10:37 pm

Re: Search Index for large # of PDF's

Post by Paul - PDF-XChange »

Hi, 4mc

this sounds great. I would advise you to pop back here from time to time over the next months to check on the progress.

Reality is that the devs all have a lot on their plate(s) and priorities are sometimes shifting, and I do not expect this to be the highest.

Lets see how things progress over the next few months and give us a bump if there is no movement?
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com