Page 1 of 1

Delete and rerun OCR getting rid of residual nonsense text

Posted: Sat Nov 30, 2024 7:46 am
by MedBooster
Deleting and rerunning the OCR of a PDF; is that possible?

You see even after using these settings
image.png

I am getting nonsense when I try copying text:
"estrogen therapy" becomes "snb\stro\\nbt\\r\py"

I suspect this is because of a poorly performed OCR.
How do you get rid of the residue from previously performed OCRs?

Even with new document and high accuracy checked the text I copy still looks nonsensical.

In the content pane the text does also appear in strange characters... We have discussed this in the past... maybe it has something to do with unsupported fonts...
viewtopic.php?t=43671&hilit=nonsense
image(1).png
Even with "fine page content" it produces the very same text output? And I get the seemingly identical nonsensical characters when I try and copy text.

It looks a bit worse when the font size and type is changed which is why I prefer using "searchable image" anyways

Similar post
viewtopic.php?p=178873&hilit=REMOVE+OCR#p178873

Re: Delete and rerun OCR getting rid of residual nonsense text

Posted: Mon Dec 02, 2024 4:00 pm
by Stefan - PDF-XChange
Hello MedBooster,

A sample page would be quite helpful here, but from the screenshots you have provided - it would seem like your document contains text, but that text has been embedded in a way that makes it's copying and correct extraction not really possible.

If you remove that existing text - this would likely remove the text from the page and won't leave a scanned image in it's place.

If you want to correct such files - you can e.g. export the current pages to images, and then OCR the result.

Kind regards,
Stefan