OCR changes English font to illegible characters

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: Tracker Support, TrackerSupp-Daniel, Sean - Tracker, Paul - Tracker Supp, Vasyl-Tracker Dev Team, Chris - Tracker Supp, Ivan - Tracker Software, Tracker Supp-Stefan

Post Reply
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

OCR changes English font to illegible characters

Post by philjv »

Support request for PDF-Xchange PRO Editor Plus Version: 10.1.2, build 382 (Enhanced OCR) software.

After performing OCR on a PDF document, it:
• changes characters, letters, alphabets, and font
• changes formatting of font
• changes formatting of sentences
• changes the line spacing with some lines disappearing, randomly
• changes font to illegible characters (not in English language)

Happened on multiple documents. Please support.
User avatar
Paul - Tracker Supp
Site Admin
Posts: 7021
Joined: Wed Mar 25, 2009 10:37 pm
Location: Chemainus, Canada
Contact:

Re: OCR changes English font to illegible characters

Post by Paul - Tracker Supp »

Hi, philjv

there are so many variables involved in the OCR process it is hard to say exactly what is happening. The most likely cause is the font on the original may not be available on your system and so a "font substitution" must be done.

May we see a sample PDF before OCR is performed please?

Kind regards,
Paul - Tracker Supp
Best regards

Paul O'Rorke
Tracker Support North America
http://www.tracker-software.com
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

Re: OCR changes English font to illegible characters

Post by philjv »

As an example, please see attached files before and after the OCR where the font changed after OCR.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 9111
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by TrackerSupp-Daniel »

Hello, philjv

I cannot seem to locate the illegible characters of which you speak here... with the exception of a few bullet points, that are not converted to more uniform objects, and some table lines that are partially removed, the OCR'ed version looks overall considerably more legible than the original does, below are a few "blink test" gifs for comparison
PDFXEdit_e71wc9F6kc.gif
PDFXEdit_hu7DGWibNE.gif
PDFXEdit_WgsIQG5wqn.gif
Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

Re: OCR changes English font to illegible characters

Post by philjv »

Hello Dan,

Thank you for your response. In the examples that I provided yesterday, those examples were provided to show only the font changes after OCR. And along with that, some table properties also got changed. Those examples were not for any others.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 9111
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by TrackerSupp-Daniel »

Hello, philjv

I see, in that case, from a font perspective, this is well within an acceptable margin of error. The original document font is "stretched" in height, and in all cases I see from comparison, taking that height stretch into account, this does appear to be the same font. OCR is not able to apply distortions to the text (yet), it simply finds the closest font available, and places characters in that location, while trying keep the same relative position to its neighbors.

Regarding the missing table lines, this is an issue that our Devs are working on, but it is a long term, gradual improvement kind of task.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

Re: OCR changes English font to illegible characters

Post by philjv »

Here are examples of another original signed document before OCR, and the same document after OCR. The OCR in PDF-Xchange PRO Editor Plus Version: 10.1.2, build 382 (Enhanced OCR) software changed the font making the OCR'd document unusable because the changes were not approved by the author of the original signed document. This document is required by the rules of most courts to be OCR'd before filing into a court's electronic filing system, but a document with unauthorized changes made in any manner after its signature cannot be filed with a court.

This is a standard usage expected of any OCR functionality whether it is with PDF-X or others. Especially, it is definitely expected in a software with "Enhanced OCR."

Please support on how to maintain the original font and properties after the OCR without making any unauthorized changes to the document.
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 9111
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by TrackerSupp-Daniel »

Hello, philjv

If you are performing OCR on a document for the purpose of submitting it to the courts, you should never be using the "editable" option, as this can and will make changes to the document content, invalidating any signatures present.

You will need to use the "searchable text" OCR option instead, which leaves the original page intact, and adds invisible text content overlayed on the respective area of the page. Do note that, as I have already mentioned in this thread, OCR is not a perfect system, mistakes can be made, and this document has a number of blemishes, as well as handwritten text, which can confuse OCR systems further. All of this means that even for searchable purposes, there may still be mistakes.

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Loki@99
User
Posts: 234
Joined: Sat Dec 16, 2023 11:09 am

Re: OCR changes English font to illegible characters

Post by Loki@99 »

Hi,
TrackerSupp-Daniel wrote: Thu Mar 14, 2024 10:41 pm OCR is not able to apply distortions to the text (yet), it simply finds the closest font available.
I also noticed that EOCR engine can't match the exact font even with system fonts and high quality images.

Here's a file sample created with Microsoft office then converted into an image file with Microsoft snipping tool.
File sample_OCR.pdf
(20.66 KiB) Downloaded 3 times
EOCR Settings
image(1).png

Before EOCR - Image file
image.png



After EOCR - Calibri and Segoe UI have become Arial
GIF - See Font in Text properties
issue.gif

As the engine used for OCR in PDFXCE is a trademark of ABBY. I would understand if Tracker Software devs can't fix that issue.
That said, it will be great if you can pass this to ABBY.

Thanks for investigating,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
TrackerSupp-Daniel
Site Admin
Posts: 9111
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by TrackerSupp-Daniel »

Hello, Loki@99

Fonts like Calibri, Segoe UI, and Arial, which are excessively similar in most resepects, will likely always have this issue...
As an example, I have adjusted the font size so the 3 match, and they look nearly identical, close enough that most differences could simply be caused by a good/poor quality scan.
image.png
The greatest "hint" is that the letter g is slightly different in shape. Even to the human eye, these are barely distinguishable from one another, aside from a minor variance in the default "size" scaling and character spacing, which are both items that can be modified by external settings, and could easily be unrelated to the original font in use.

OCR needs to allow a degree of "fuzzyness" at all times, because it is designed to try and find the correct text in an imperfect situation, and assuming that an image is absolutely perfect will not go well.

While trying to find the closest match is indeed the goal, for all intents and purposes, both Segoe UI and Calibri are more than close enough to Arial to get flagged as such, and it is more important that the correct text is found (as an example, the "g" mentioned before, could be mis-recognized as an offset "S" if OCR was too strict with character set matching rules).

Kind regards,
Dan McIntyre - Support Technician
Tracker Software Products (Canada) LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply