Deleting and rerunning the OCR of a PDF; is that possible?
You see even after using these settings
I am getting nonsense when I try copying text:
"estrogen therapy" becomes "snb\stro\\nbt\\r\py"
I suspect this is because of a poorly performed OCR.
How do you get rid of the residue from previously performed OCRs?
Even with new document and high accuracy checked the text I copy still looks nonsensical.
In the content pane the text does also appear in strange characters... We have discussed this in the past... maybe it has something to do with unsupported fonts...
viewtopic.php?t=43671&hilit=nonsense
Even with "fine page content" it produces the very same text output? And I get the seemingly identical nonsensical characters when I try and copy text.
It looks a bit worse when the font size and type is changed which is why I prefer using "searchable image" anyways
Similar post
viewtopic.php?p=178873&hilit=REMOVE+OCR#p178873
Delete and rerun OCR getting rid of residual nonsense text
Moderators: Daniel - PDF-XChange, PDF-XChange Support, Vasyl - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange
-
- User
- Posts: 1372
- Joined: Mon Nov 15, 2021 8:38 pm
Delete and rerun OCR getting rid of residual nonsense text
My wishlist https://forum.pdf-xchange.com/viewtopic.php?p=187394#p187394
Disable SPACE page navigation, fix kb shortcut for highlighting advanced search tool search field, bookmarks with numbers, toolbar small icon size, AltGr/Ctrl+Alt keyboard issues
Disable SPACE page navigation, fix kb shortcut for highlighting advanced search tool search field, bookmarks with numbers, toolbar small icon size, AltGr/Ctrl+Alt keyboard issues
- Stefan - PDF-XChange
- Site Admin
- Posts: 19794
- Joined: Mon Jan 12, 2009 8:07 am
- Contact:
Re: Delete and rerun OCR getting rid of residual nonsense text
Hello MedBooster,
A sample page would be quite helpful here, but from the screenshots you have provided - it would seem like your document contains text, but that text has been embedded in a way that makes it's copying and correct extraction not really possible.
If you remove that existing text - this would likely remove the text from the page and won't leave a scanned image in it's place.
If you want to correct such files - you can e.g. export the current pages to images, and then OCR the result.
Kind regards,
Stefan
A sample page would be quite helpful here, but from the screenshots you have provided - it would seem like your document contains text, but that text has been embedded in a way that makes it's copying and correct extraction not really possible.
If you remove that existing text - this would likely remove the text from the page and won't leave a scanned image in it's place.
If you want to correct such files - you can e.g. export the current pages to images, and then OCR the result.
Kind regards,
Stefan