Delete and rerun OCR getting rid of residual nonsense text

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: Daniel - PDF-XChange, PDF-XChange Support, Vasyl - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

Post Reply
MedBooster
User
Posts: 1372
Joined: Mon Nov 15, 2021 8:38 pm

Delete and rerun OCR getting rid of residual nonsense text

Post by MedBooster »

Deleting and rerunning the OCR of a PDF; is that possible?

You see even after using these settings
image.png

I am getting nonsense when I try copying text:
"estrogen therapy" becomes "snb\stro\\nbt\\r\py"

I suspect this is because of a poorly performed OCR.
How do you get rid of the residue from previously performed OCRs?

Even with new document and high accuracy checked the text I copy still looks nonsensical.

In the content pane the text does also appear in strange characters... We have discussed this in the past... maybe it has something to do with unsupported fonts...
viewtopic.php?t=43671&hilit=nonsense
image(1).png
Even with "fine page content" it produces the very same text output? And I get the seemingly identical nonsensical characters when I try and copy text.

It looks a bit worse when the font size and type is changed which is why I prefer using "searchable image" anyways

Similar post
viewtopic.php?p=178873&hilit=REMOVE+OCR#p178873
My wishlist https://forum.pdf-xchange.com/viewtopic.php?p=187394#p187394
Disable SPACE page navigation, fix kb shortcut for highlighting advanced search tool search field, bookmarks with numbers, toolbar small icon size, AltGr/Ctrl+Alt keyboard issues
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19788
Joined: Mon Jan 12, 2009 8:07 am
Contact:

Re: Delete and rerun OCR getting rid of residual nonsense text

Post by Stefan - PDF-XChange »

Hello MedBooster,

A sample page would be quite helpful here, but from the screenshots you have provided - it would seem like your document contains text, but that text has been embedded in a way that makes it's copying and correct extraction not really possible.

If you remove that existing text - this would likely remove the text from the page and won't leave a scanned image in it's place.

If you want to correct such files - you can e.g. export the current pages to images, and then OCR the result.

Kind regards,
Stefan
Post Reply