Page 1 of 1

Trying to fix a PDF OCR encoding issue

Posted: Mon Aug 05, 2024 5:38 pm
by Loki@99
Hi,

I'm trying to fix the OCR on this PDF file which likely had an encoding issue (I don't know how the original OCR was performed).
File_OCR_encoding issue.pdf
image.png

For the "fix", I tried to Rasterize the file (so that I can OCR with PDFXCE after) with the following tweaks in order to have a good quality
- Compression : JPEG
- JPEG Quality : High
- 300 DPI

The issue is that after the Rasterize action, the file size became very large (537MB).
I'm aware that it is intended as the PDF will exclusively contain high definition images.

I tried to Save as Optimized with the "Standard" profile but unfortunately the file size is still way larger (169MB) than the original file (58MB) and the quality has slightly deteriorated.

Maybe you have a solution for my issue,

Thanks for investigating,

Re: Trying to fix a PDF OCR encoding issue

Posted: Mon Aug 05, 2024 8:35 pm
by Daniel - PDF-XChange
Hello, Loki@99

There should be no need to rasterize the page in this situation. The only time rasterizing is requires before OCR is when you only want to generate "searchable" text, but with to leave the facsimile of the original scanned pages intact.

If you are looking to enable editing for the text, you can simply uncheck the "ignore text" option to get the same effect without creating a number of large images in the file:
image.png
This will allow the OCR to replace the existing text on the page, with a new editable text layer.

Kind regards,

Re: Trying to fix a PDF OCR encoding issue

Posted: Tue Aug 06, 2024 4:42 am
by Loki@99
Hi,
The only time rasterizing is requires before OCR is when you only want to generate "searchable" text, but with to leave the facsimile of the original scanned pages intact
Well, that's exactly the reason why I'm trying to fix that PDF OCR.

Fortunately, uncheking Ignore existing text on page and selecting Searchable Image output settings fixed the issue.

image(1).png

There's still content left from the old OCR but it's no big deal as the search feature works now.

Old content
image.png

Thanks for your assistance,

Re: Trying to fix a PDF OCR encoding issue

Posted: Tue Aug 06, 2024 11:18 am
by Stefan - PDF-XChange
Hello Loki@99,

You could delete the previous OCR result first, and then run our OCR - that way the wrong encoding text will be removed and not mess with your content pane.

Kind regards,
Stefan

Re: Trying to fix a PDF OCR encoding issue

Posted: Tue Aug 06, 2024 2:48 pm
by Loki@99
Hi Stefan,
Tracker Supp-Stefan wrote: Tue Aug 06, 2024 11:18 am You could delete the previous OCR result first, and then run our OCR - that way the wrong encoding text will be removed and not mess with your content pane.
Unfortunately, it's not possible as it will delete all data.

GIF - Delete the wrong encoding
OCR.gif

Re: Trying to fix a PDF OCR encoding issue  SOLVED

Posted: Tue Aug 06, 2024 3:01 pm
by Stefan - PDF-XChange
Hello Loki@99,

Ahh! So the original image pixels have already been removed!
Yes in that case what you did above should help! :D

Kind regards,
Stefan