Trying to fix a PDF OCR encoding issue  SOLVED

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: Daniel - PDF-XChange, PDF-XChange Support, Vasyl - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

Post Reply
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Trying to fix a PDF OCR encoding issue

Post by Loki@99 »

Hi,

I'm trying to fix the OCR on this PDF file which likely had an encoding issue (I don't know how the original OCR was performed).
File_OCR_encoding issue.pdf
(58.38 MiB) Downloaded 89 times
image.png

For the "fix", I tried to Rasterize the file (so that I can OCR with PDFXCE after) with the following tweaks in order to have a good quality
- Compression : JPEG
- JPEG Quality : High
- 300 DPI

The issue is that after the Rasterize action, the file size became very large (537MB).
I'm aware that it is intended as the PDF will exclusively contain high definition images.

I tried to Save as Optimized with the "Standard" profile but unfortunately the file size is still way larger (169MB) than the original file (58MB) and the quality has slightly deteriorated.

Maybe you have a solution for my issue,

Thanks for investigating,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10921
Joined: Wed Jan 03, 2018 6:52 pm

Re: Trying to fix a PDF OCR encoding issue

Post by Daniel - PDF-XChange »

Hello, Loki@99

There should be no need to rasterize the page in this situation. The only time rasterizing is requires before OCR is when you only want to generate "searchable" text, but with to leave the facsimile of the original scanned pages intact.

If you are looking to enable editing for the text, you can simply uncheck the "ignore text" option to get the same effect without creating a number of large images in the file:
image.png
This will allow the OCR to replace the existing text on the page, with a new editable text layer.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Re: Trying to fix a PDF OCR encoding issue

Post by Loki@99 »

Hi,
The only time rasterizing is requires before OCR is when you only want to generate "searchable" text, but with to leave the facsimile of the original scanned pages intact
Well, that's exactly the reason why I'm trying to fix that PDF OCR.

Fortunately, uncheking Ignore existing text on page and selecting Searchable Image output settings fixed the issue.

image(1).png

There's still content left from the old OCR but it's no big deal as the search feature works now.

Old content
image.png

Thanks for your assistance,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19807
Joined: Mon Jan 12, 2009 8:07 am
Contact:

Re: Trying to fix a PDF OCR encoding issue

Post by Stefan - PDF-XChange »

Hello Loki@99,

You could delete the previous OCR result first, and then run our OCR - that way the wrong encoding text will be removed and not mess with your content pane.

Kind regards,
Stefan
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Re: Trying to fix a PDF OCR encoding issue

Post by Loki@99 »

Hi Stefan,
Tracker Supp-Stefan wrote: Tue Aug 06, 2024 11:18 am You could delete the previous OCR result first, and then run our OCR - that way the wrong encoding text will be removed and not mess with your content pane.
Unfortunately, it's not possible as it will delete all data.

GIF - Delete the wrong encoding
OCR.gif
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19807
Joined: Mon Jan 12, 2009 8:07 am
Contact:

Re: Trying to fix a PDF OCR encoding issue  SOLVED

Post by Stefan - PDF-XChange »

Hello Loki@99,

Ahh! So the original image pixels have already been removed!
Yes in that case what you did above should help! :D

Kind regards,
Stefan
Post Reply