Hello Forum and PXE Support Team,
I've noticed that the "Ignore Existing Text on Page" option in the OCR function doesn't seem to work as expected in all cases.
For example, when using OCR on bitmap-based, i.e. scanned pages, the entire page is re-processed and an additional layer of invisible text is added for the entire page, even if this option is selected.
This happens regardless of whether the page already contains any (invisible) OCR text from previous runs or from the original file.
It would be great if PDF-XChange Editor could avoid adding another text layer to page areas where invisible OCR text already exists. While I understand this might be technically challenging, it would improve efficiency and prevent unnecessary text duplication, especially when using search functions later, as they would otherwise show/count matching text instances twice.
Here's my typical use case scenario regarding this issue:
I often work with files containing hundreds of pages put together from diverse sources: some pages are pure text or mixed vector content, others are bitmap-based, and some pages already have invisible OCR text while others don't. Many mixed content pages also may exist, combining vector/text elements with embedded bitmap images that may contain bitmapped text which has not been OCR'd yet.
I would then run OCR on the entire document to only add OCR text to those pages or page areas where it's still missing.
However, currently, all bitmap-based pages or page areas then get an extra layer of OCR text, even in places where OCR text already exists.
The "Skip Pages that Already Contain Text" option doesn't solve this, as it skips mixed-content pages entirely, leaving bitmap images or page areas on those pages unprocessed.
Thank you for looking into this!
Best regards
David
OCR Issue with "Ignore Existing Text on Page"
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange
-
- User
- Posts: 1648
- Joined: Thu Feb 28, 2008 8:16 pm
OCR Issue with "Ignore Existing Text on Page"
You do not have the required permissions to view the files attached to this post.
David.P
PDF-XChange Pro
PDF-XChange Pro
-
- Site Admin
- Posts: 2261
- Joined: Mon Jan 15, 2018 9:01 am
Re: OCR Issue with "Ignore Existing Text on Page"
Hello David,
Is there any chance you could provide us with a copy of one of the files you are experiencing this issue with?
Is there any chance you could provide us with a copy of one of the files you are experiencing this issue with?
-
- User
- Posts: 1648
- Joined: Thu Feb 28, 2008 8:16 pm
Re: OCR Issue with "Ignore Existing Text on Page"
Sure Dimitar, will do ASAIGATI
David.P
PDF-XChange Pro
PDF-XChange Pro
-
- Site Admin
- Posts: 2261
- Joined: Mon Jan 15, 2018 9:01 am