OCR changes English font to illegible characters

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: Daniel - PDF-XChange, PDF-XChange Support, Vasyl - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

Post Reply
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

OCR changes English font to illegible characters

Post by philjv »

Support request for PDF-Xchange PRO Editor Plus Version: 10.1.2, build 382 (Enhanced OCR) software.

After performing OCR on a PDF document, it:
• changes characters, letters, alphabets, and font
• changes formatting of font
• changes formatting of sentences
• changes the line spacing with some lines disappearing, randomly
• changes font to illegible characters (not in English language)

Happened on multiple documents. Please support.
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7356
Joined: Wed Mar 25, 2009 10:37 pm
Contact:

Re: OCR changes English font to illegible characters

Post by Paul - PDF-XChange »

Hi, philjv

there are so many variables involved in the OCR process it is hard to say exactly what is happening. The most likely cause is the font on the original may not be available on your system and so a "font substitution" must be done.

May we see a sample PDF before OCR is performed please?

Kind regards,
Paul - Tracker Supp
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

Re: OCR changes English font to illegible characters

Post by philjv »

As an example, please see attached files before and after the OCR where the font changed after OCR.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10884
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by Daniel - PDF-XChange »

Hello, philjv

I cannot seem to locate the illegible characters of which you speak here... with the exception of a few bullet points, that are not converted to more uniform objects, and some table lines that are partially removed, the OCR'ed version looks overall considerably more legible than the original does, below are a few "blink test" gifs for comparison
PDFXEdit_e71wc9F6kc.gif
PDFXEdit_hu7DGWibNE.gif
PDFXEdit_WgsIQG5wqn.gif
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

Re: OCR changes English font to illegible characters

Post by philjv »

Hello Dan,

Thank you for your response. In the examples that I provided yesterday, those examples were provided to show only the font changes after OCR. And along with that, some table properties also got changed. Those examples were not for any others.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10884
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by Daniel - PDF-XChange »

Hello, philjv

I see, in that case, from a font perspective, this is well within an acceptable margin of error. The original document font is "stretched" in height, and in all cases I see from comparison, taking that height stretch into account, this does appear to be the same font. OCR is not able to apply distortions to the text (yet), it simply finds the closest font available, and places characters in that location, while trying keep the same relative position to its neighbors.

Regarding the missing table lines, this is an issue that our Devs are working on, but it is a long term, gradual improvement kind of task.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
philjv
User
Posts: 5
Joined: Wed Mar 13, 2024 8:07 pm

Re: OCR changes English font to illegible characters

Post by philjv »

Here are examples of another original signed document before OCR, and the same document after OCR. The OCR in PDF-Xchange PRO Editor Plus Version: 10.1.2, build 382 (Enhanced OCR) software changed the font making the OCR'd document unusable because the changes were not approved by the author of the original signed document. This document is required by the rules of most courts to be OCR'd before filing into a court's electronic filing system, but a document with unauthorized changes made in any manner after its signature cannot be filed with a court.

This is a standard usage expected of any OCR functionality whether it is with PDF-X or others. Especially, it is definitely expected in a software with "Enhanced OCR."

Please support on how to maintain the original font and properties after the OCR without making any unauthorized changes to the document.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10884
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by Daniel - PDF-XChange »

Hello, philjv

If you are performing OCR on a document for the purpose of submitting it to the courts, you should never be using the "editable" option, as this can and will make changes to the document content, invalidating any signatures present.

You will need to use the "searchable text" OCR option instead, which leaves the original page intact, and adds invisible text content overlayed on the respective area of the page. Do note that, as I have already mentioned in this thread, OCR is not a perfect system, mistakes can be made, and this document has a number of blemishes, as well as handwritten text, which can confuse OCR systems further. All of this means that even for searchable purposes, there may still be mistakes.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Re: OCR changes English font to illegible characters

Post by Loki@99 »

Hi,
TrackerSupp-Daniel wrote: Thu Mar 14, 2024 10:41 pm OCR is not able to apply distortions to the text (yet), it simply finds the closest font available.
I also noticed that EOCR engine can't match the exact font even with system fonts and high quality images.

Here's a file sample created with Microsoft office then converted into an image file with Microsoft snipping tool.
File sample_OCR.pdf
(20.66 KiB) Downloaded 58 times
EOCR Settings
image(1).png

Before EOCR - Image file
image.png



After EOCR - Calibri and Segoe UI have become Arial
GIF - See Font in Text properties
issue.gif

As the engine used for OCR in PDFXCE is a trademark of ABBY. I would understand if Tracker Software devs can't fix that issue.
That said, it will be great if you can pass this to ABBY.

Thanks for investigating,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10884
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by Daniel - PDF-XChange »

Hello, Loki@99

Fonts like Calibri, Segoe UI, and Arial, which are excessively similar in most resepects, will likely always have this issue...
As an example, I have adjusted the font size so the 3 match, and they look nearly identical, close enough that most differences could simply be caused by a good/poor quality scan.
image.png
The greatest "hint" is that the letter g is slightly different in shape. Even to the human eye, these are barely distinguishable from one another, aside from a minor variance in the default "size" scaling and character spacing, which are both items that can be modified by external settings, and could easily be unrelated to the original font in use.

OCR needs to allow a degree of "fuzzyness" at all times, because it is designed to try and find the correct text in an imperfect situation, and assuming that an image is absolutely perfect will not go well.

While trying to find the closest match is indeed the goal, for all intents and purposes, both Segoe UI and Calibri are more than close enough to Arial to get flagged as such, and it is more important that the correct text is found (as an example, the "g" mentioned before, could be mis-recognized as an offset "S" if OCR was too strict with character set matching rules).

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Re: OCR changes English font to illegible characters

Post by Loki@99 »

Hi,

After trying PDFXCE EOCR with various scanned PDF, I have to conclude that Editable Output and Fine Page Content are too much unreliable.

Apart from changing font issue, the way it changes characters, formatting (bold, italic...), bullets size, symbols are major issues.

image.png

The only solution for this is to be able to create those "fake distorted" fonts like Adobe Acrobat Editable OCR output which I would say matches 99% of the original layout.

Here are attached files illustrating the major issues containing Original PDF, PDFXCE EOCR and Adobe Acrobat Editable OCR Output at 600DPI.
File_sample_1.pdf
(525.74 KiB) Downloaded 58 times
File_sample_2.pdf
(3.72 MiB) Downloaded 58 times

Thanks for considering that "fake font" output.
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
theCornflower
User
Posts: 1
Joined: Mon Oct 07, 2024 4:02 pm

Re: OCR changes English font to illegible characters

Post by theCornflower »

My best guess going over the submissions, Loki_991, is that the image size is such that the Enhanced OCR cannot distinguish font from image. I'd had better luck with larger images with text. My suspicion is that the OCR engine is treating this as purely a graphic image, as opposed to image text, though why it would not just leave it alone is unclear.
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Re: OCR changes English font to illegible characters

Post by Loki@99 »

Hi,
theCornflower wrote: Mon Oct 07, 2024 4:07 pm My best guess going over the submissions, Loki_991, is that the image size is such that the Enhanced OCR cannot distinguish font from image. I'd had better luck with larger images with text. My suspicion is that the OCR engine is treating this as purely a graphic image, as opposed to image text, though why it would not just leave it alone is unclear.
@theCornflower

As I said I already did the same with various PDF so it's not related to size. Hence I came to that conclusion.

The "fake distorted fonts" from Adobe Acrobat is the only solution.
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10884
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by Daniel - PDF-XChange »

Hello, everyone

@Loki, Adobe has many of the same issues we do with recognizing the actual content, as a few quick examples (below are two Gif animations, click to play):
PDFXEdit_8eneSXX4bn.gif
PDFXEdit_dtAc6DqA3o.gif
Both of these are from the Adobe side of your examples, while they do use a different method from ours to "emulate" retention of the original text, this is really just creating a new font, using a smoothed version of what was on the page as the glyph. Then they assign it to a recognized character (which may be incorrect). While it does "visually" appear correct to a human, it is very much still just as volatile, you just don't notice it right away because they are better at masking these mistakes.

Going back to my earlier statement, no OCR engine is perfect, ours is not an exception, and neither is Adobe.
Regarding the color change you mentioned, I am quite colorblind (especially with the blue spectrum), so visually I see no difference, but confirming with a colleague whos eyes work properly, then grabbing the color picker and comparing, there it does appear that the color is just a normalized value given the various colors of the rasterized text content:
image.png
image(1).png
It would be impossible to find a perfect match given the variance in color over each pixel of the original image, so this is within an acceptable margin of difference.

Finally, the missing content with "find page content" mode. This could be an issue, but to my understanding it is expected that some (non-OCRable) content goes missing during this operation. As with the other items, I will be passing this along to the Dev team for review nonetheless. To my knowledge Adobe does not offer a fine page content option, so comparing it to the "editable text and images" option is more accurate.

@Cornflower, These items are recognized as text because they are inline with text, nearly the same size as the text content, and the shapes are close enough to letters to be caught, even if what is recognized is the border around the letter, and not the actual content.

As with any report of trouble in the OCR, because it is not perfect, there is always room for improvement. All of this will be moving up to the Dev team, who will communicate with ABBYY's team, and see what can be done to help. But as the OCR engine is provided to us by a third party, I cannot promise any timelines on fixes or to feature improvements.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Loki@99
User
Posts: 558
Joined: Sat Dec 16, 2023 11:09 am

Re: OCR changes English font to illegible characters

Post by Loki@99 »

Hi Daniel,
TrackerSupp-Daniel wrote: Mon Oct 07, 2024 7:17 pm @Loki, Adobe has many of the same issues we do with recognizing the actual content, as a few quick examples
You're right. I totally agree that no OCR is perfect at recognition. But what I'm talking about here doesn't really concern accuracy recoginition.

I'm fine with PDFXCE and Adobe Acrobat having wrong character recognition if at least the content is "visually" the same (which is not the case with PDFXCE Editable or Fine Page Content output). In other words, I prioritize "visual fidelity" over recognition accuracy.

Another good point of Adobe Acrobat "fake fonts" is that it significantly reduce the file size.
TrackerSupp-Daniel wrote: Mon Oct 07, 2024 7:17 pm Regarding the color change you mentioned, I am quite colorblind (especially with the blue spectrum), so visually I see no difference, but confirming with a colleague whos eyes work properly, then grabbing the color picker and comparing, there it does appear that the color is just a normalized value given the various colors of the rasterized text content
I pointed the color of the table border which should be red not the text though.
TrackerSupp-Daniel wrote: Mon Oct 07, 2024 7:17 pm As with any report of trouble in the OCR, because it is not perfect, there is always room for improvement. All of this will be moving up to the Dev team, who will communicate with ABBYY's team, and see what can be done to help. But as the OCR engine is provided to us by a third party, I cannot promise any timelines on fixes or to feature improvements.
Unfortunately I don't think that ABBY's team will be much of help for this. I had a copy of ABBY FineReader 16 back then and had the same issue. Tried Foxit PDF OCR and same issue.

Looks like only Adobe Acrobat has that prioritize "visual fidelity" feature. It was called ClearScan in versions prior to Adobe Acrobat XI but they renamed it to Editable output.

I'm aware that I can simply use PDFXCE Searchable output but as I said the way Adobe Acrobat handles it drastically reduce the file size and add some clarity as well cause fonts are vectorized.

Thus, what I would like you to pass to the Dev team is not recognition accuracy but that "visual fidelity" with "fake fonts" output.

Having that feature in PDFXCE will be fantastic.

Thank you,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10884
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR changes English font to illegible characters

Post by Daniel - PDF-XChange »

Hello, Loki@99

Unfortunately that is not a feature we can offer at this time. That would be something the OCR engine needs to develop, and which we would need to update the version of the engine we use in order to benefit from. At this time it is something which our OCR guys are well aware of the demand for, but not something we can promise will be coming to the Editor in any discernable timeframe.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply