[suggestion] Search for documents that contain pages without text  SOLVED

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

User avatar
Jensen Head
User
Posts: 582
Joined: Mon Sep 13, 2021 8:12 am

[suggestion] Search for documents that contain pages without text

Post by Jensen Head »

In the forum section "PDF-Tools" there is a similar topic "How to find PDF-documents without "live" text?", for which I found a third-party solution. However, as far as I understand, currently in the advanced document search there is no way to get a list of documents, each of which has at least one page without text objects. In the case of scanned documents with subsequent OCR, I cannot know for sure whether the page without text is a blank page (in most cases, requiring removal), or a page with an illustration without text, or a page with text but of such poor quality that even erroneous text objects were not created after OCR. Therefore, I have to check all the documents manually. However, there are relatively few documents with pages without text, no more than 10% of the total, and manual checking would take less time if the advanced search had an option to search for pages without text (currently there is none).
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11029
Joined: Wed Jan 03, 2018 6:52 pm

Re: [suggestion] Search for documents that contain pages without text

Post by Daniel - PDF-XChange »

Hello, Jensen Head

What is the core purpose of this search action? If it is to remove those pages, than this tool should serve that purpose as is:
image.png
image(1).png
Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
Jensen Head
User
Posts: 582
Joined: Mon Sep 13, 2021 8:12 am

Re: [suggestion] Search for documents that contain pages without text  SOLVED

Post by Jensen Head »

I don't know how to rephrase it more clearly.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11029
Joined: Wed Jan 03, 2018 6:52 pm

Re: [suggestion] Search for documents that contain pages without text

Post by Daniel - PDF-XChange »

Hello, Jensen Head

Did you try the option I mentioned above? After you do what I showed there, it offers a dialogue where you are able to peruse, delete items, or select items in the thumbnails pane.
image.png
This seems like it would fit your needs. If you set the "tolerance" level to the maximum (move the slider right), it can even consider pages full of content to be valid targets for deletion. Playing with that slider should let you find the sweetspot you need to "narrow down" the process, I think.

And a quick update, I passed this along to the Dev team, we will be looking into adding this functionality to PDF-Tools in a future release. I cannot offer a timeline, but I do have a ticket number:

RT#7515: FR: PDF-Tools "delete pages" update

Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
David.P
User
Posts: 1638
Joined: Thu Feb 28, 2008 8:16 pm

Re: [suggestion] Search for documents that contain pages without text

Post by David.P »

The advanced "Select Pages" dialog or the context menu in the thumbnails pane allows to select all pages that contain text. By inverting the selection, you are left with the pages that do not contain any text.

However, as I gather from your original post, you are trying to do that not on single documents, but on a large number of documents.
While this could be done, I believe it would require some kind of scripting, for example with AutoHotkey.

This could involve opening every document, selecting all pages that contain text, deleting them, and then saving and closing the document. Since deleting all pages of a document doesn't work (in a case where all pages do contain text), the script would fail on documents that do not contain pages without text. I.e., these documents would simply be closed without having been saved beforehand.

You would have to do this on copies of all the documents.

That way, you could identify the document names that contain at least one page without text.

However, it's not going to be a very simple scripting task.
David.P
PDF-XChange Pro
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11029
Joined: Wed Jan 03, 2018 6:52 pm

Re: [suggestion] Search for documents that contain pages without text

Post by Daniel - PDF-XChange »

Hello, David.P

The way I read it, one of the key points from the initial post is this need:
Jensen Head wrote: Mon Jun 02, 2025 11:44 am In the case of scanned documents with subsequent OCR, I cannot know for sure whether the page without text is a blank page (in most cases, requiring removal), or a page with an illustration without text, or a page with text but of such poor quality that even erroneous text objects were not created after OCR.
With the requirement of being able to catch pages which may have content present on them, but are "mostly" blank in appearance, the only tool which can accomplish this is the aforementioned Delete pages tool. Select pages it not able to differentiate contents "white coverage" in this way.

Perhaps an AHK script could accomplish this, but yes, it would be a great deal of work and time, as well as trial and error to get it working.
Moving an already existing feature from the Editor, over to PDF-Tools, is much easier than building something new from scratch, so I do not imagine this FR will remain open for long before it is ready for implementation.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com