OCR changes English font to illegible characters
Moderators: Daniel - PDF-XChange, PDF-XChange Support, Vasyl - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange
OCR changes English font to illegible characters
Support request for PDF-Xchange PRO Editor Plus Version: 10.1.2, build 382 (Enhanced OCR) software.
After performing OCR on a PDF document, it:
• changes characters, letters, alphabets, and font
• changes formatting of font
• changes formatting of sentences
• changes the line spacing with some lines disappearing, randomly
• changes font to illegible characters (not in English language)
Happened on multiple documents. Please support.
After performing OCR on a PDF document, it:
• changes characters, letters, alphabets, and font
• changes formatting of font
• changes formatting of sentences
• changes the line spacing with some lines disappearing, randomly
• changes font to illegible characters (not in English language)
Happened on multiple documents. Please support.
- Paul - PDF-XChange
- Site Admin
- Posts: 7356
- Joined: Wed Mar 25, 2009 10:37 pm
- Contact:
Re: OCR changes English font to illegible characters
Hi, philjv
there are so many variables involved in the OCR process it is hard to say exactly what is happening. The most likely cause is the font on the original may not be available on your system and so a "font substitution" must be done.
May we see a sample PDF before OCR is performed please?
Kind regards,
Paul - Tracker Supp
there are so many variables involved in the OCR process it is hard to say exactly what is happening. The most likely cause is the font on the original may not be available on your system and so a "font substitution" must be done.
May we see a sample PDF before OCR is performed please?
Kind regards,
Paul - Tracker Supp
Best regards
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Re: OCR changes English font to illegible characters
As an example, please see attached files before and after the OCR where the font changed after OCR.
- Daniel - PDF-XChange
- Site Admin
- Posts: 10884
- Joined: Wed Jan 03, 2018 6:52 pm
Re: OCR changes English font to illegible characters
Hello, philjv
I cannot seem to locate the illegible characters of which you speak here... with the exception of a few bullet points, that are not converted to more uniform objects, and some table lines that are partially removed, the OCR'ed version looks overall considerably more legible than the original does, below are a few "blink test" gifs for comparison Kind regards,
I cannot seem to locate the illegible characters of which you speak here... with the exception of a few bullet points, that are not converted to more uniform objects, and some table lines that are partially removed, the OCR'ed version looks overall considerably more legible than the original does, below are a few "blink test" gifs for comparison Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: OCR changes English font to illegible characters
Hello Dan,
Thank you for your response. In the examples that I provided yesterday, those examples were provided to show only the font changes after OCR. And along with that, some table properties also got changed. Those examples were not for any others.
Thank you for your response. In the examples that I provided yesterday, those examples were provided to show only the font changes after OCR. And along with that, some table properties also got changed. Those examples were not for any others.
- Daniel - PDF-XChange
- Site Admin
- Posts: 10884
- Joined: Wed Jan 03, 2018 6:52 pm
Re: OCR changes English font to illegible characters
Hello, philjv
I see, in that case, from a font perspective, this is well within an acceptable margin of error. The original document font is "stretched" in height, and in all cases I see from comparison, taking that height stretch into account, this does appear to be the same font. OCR is not able to apply distortions to the text (yet), it simply finds the closest font available, and places characters in that location, while trying keep the same relative position to its neighbors.
Regarding the missing table lines, this is an issue that our Devs are working on, but it is a long term, gradual improvement kind of task.
Kind regards,
I see, in that case, from a font perspective, this is well within an acceptable margin of error. The original document font is "stretched" in height, and in all cases I see from comparison, taking that height stretch into account, this does appear to be the same font. OCR is not able to apply distortions to the text (yet), it simply finds the closest font available, and places characters in that location, while trying keep the same relative position to its neighbors.
Regarding the missing table lines, this is an issue that our Devs are working on, but it is a long term, gradual improvement kind of task.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: OCR changes English font to illegible characters
This is a standard usage expected of any OCR functionality whether it is with PDF-X or others. Especially, it is definitely expected in a software with "Enhanced OCR."
Please support on how to maintain the original font and properties after the OCR without making any unauthorized changes to the document.
- Daniel - PDF-XChange
- Site Admin
- Posts: 10884
- Joined: Wed Jan 03, 2018 6:52 pm
Re: OCR changes English font to illegible characters
Hello, philjv
If you are performing OCR on a document for the purpose of submitting it to the courts, you should never be using the "editable" option, as this can and will make changes to the document content, invalidating any signatures present.
You will need to use the "searchable text" OCR option instead, which leaves the original page intact, and adds invisible text content overlayed on the respective area of the page. Do note that, as I have already mentioned in this thread, OCR is not a perfect system, mistakes can be made, and this document has a number of blemishes, as well as handwritten text, which can confuse OCR systems further. All of this means that even for searchable purposes, there may still be mistakes.
Kind regards,
If you are performing OCR on a document for the purpose of submitting it to the courts, you should never be using the "editable" option, as this can and will make changes to the document content, invalidating any signatures present.
You will need to use the "searchable text" OCR option instead, which leaves the original page intact, and adds invisible text content overlayed on the respective area of the page. Do note that, as I have already mentioned in this thread, OCR is not a perfect system, mistakes can be made, and this document has a number of blemishes, as well as handwritten text, which can confuse OCR systems further. All of this means that even for searchable purposes, there may still be mistakes.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: OCR changes English font to illegible characters
Hi,
Here's a file sample created with Microsoft office then converted into an image file with Microsoft snipping tool. EOCR Settings
Before EOCR - Image file
After EOCR - Calibri and Segoe UI have become Arial
GIF - See Font in Text properties
As the engine used for OCR in PDFXCE is a trademark of ABBY. I would understand if Tracker Software devs can't fix that issue.
That said, it will be great if you can pass this to ABBY.
Thanks for investigating,
I also noticed that EOCR engine can't match the exact font even with system fonts and high quality images.TrackerSupp-Daniel wrote: ↑Thu Mar 14, 2024 10:41 pm OCR is not able to apply distortions to the text (yet), it simply finds the closest font available.
Here's a file sample created with Microsoft office then converted into an image file with Microsoft snipping tool. EOCR Settings
Before EOCR - Image file
After EOCR - Calibri and Segoe UI have become Arial
GIF - See Font in Text properties
As the engine used for OCR in PDFXCE is a trademark of ABBY. I would understand if Tracker Software devs can't fix that issue.
That said, it will be great if you can pass this to ABBY.
Thanks for investigating,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- Daniel - PDF-XChange
- Site Admin
- Posts: 10884
- Joined: Wed Jan 03, 2018 6:52 pm
Re: OCR changes English font to illegible characters
Hello, Loki@99
Fonts like Calibri, Segoe UI, and Arial, which are excessively similar in most resepects, will likely always have this issue...
As an example, I have adjusted the font size so the 3 match, and they look nearly identical, close enough that most differences could simply be caused by a good/poor quality scan. The greatest "hint" is that the letter g is slightly different in shape. Even to the human eye, these are barely distinguishable from one another, aside from a minor variance in the default "size" scaling and character spacing, which are both items that can be modified by external settings, and could easily be unrelated to the original font in use.
OCR needs to allow a degree of "fuzzyness" at all times, because it is designed to try and find the correct text in an imperfect situation, and assuming that an image is absolutely perfect will not go well.
While trying to find the closest match is indeed the goal, for all intents and purposes, both Segoe UI and Calibri are more than close enough to Arial to get flagged as such, and it is more important that the correct text is found (as an example, the "g" mentioned before, could be mis-recognized as an offset "S" if OCR was too strict with character set matching rules).
Kind regards,
Fonts like Calibri, Segoe UI, and Arial, which are excessively similar in most resepects, will likely always have this issue...
As an example, I have adjusted the font size so the 3 match, and they look nearly identical, close enough that most differences could simply be caused by a good/poor quality scan. The greatest "hint" is that the letter g is slightly different in shape. Even to the human eye, these are barely distinguishable from one another, aside from a minor variance in the default "size" scaling and character spacing, which are both items that can be modified by external settings, and could easily be unrelated to the original font in use.
OCR needs to allow a degree of "fuzzyness" at all times, because it is designed to try and find the correct text in an imperfect situation, and assuming that an image is absolutely perfect will not go well.
While trying to find the closest match is indeed the goal, for all intents and purposes, both Segoe UI and Calibri are more than close enough to Arial to get flagged as such, and it is more important that the correct text is found (as an example, the "g" mentioned before, could be mis-recognized as an offset "S" if OCR was too strict with character set matching rules).
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: OCR changes English font to illegible characters
Hi,
After trying PDFXCE EOCR with various scanned PDF, I have to conclude that Editable Output and Fine Page Content are too much unreliable.
Apart from changing font issue, the way it changes characters, formatting (bold, italic...), bullets size, symbols are major issues.
The only solution for this is to be able to create those "fake distorted" fonts like Adobe Acrobat Editable OCR output which I would say matches 99% of the original layout.
Here are attached files illustrating the major issues containing Original PDF, PDFXCE EOCR and Adobe Acrobat Editable OCR Output at 600DPI.
Thanks for considering that "fake font" output.
After trying PDFXCE EOCR with various scanned PDF, I have to conclude that Editable Output and Fine Page Content are too much unreliable.
Apart from changing font issue, the way it changes characters, formatting (bold, italic...), bullets size, symbols are major issues.
The only solution for this is to be able to create those "fake distorted" fonts like Adobe Acrobat Editable OCR output which I would say matches 99% of the original layout.
Here are attached files illustrating the major issues containing Original PDF, PDFXCE EOCR and Adobe Acrobat Editable OCR Output at 600DPI.
Thanks for considering that "fake font" output.
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
-
- User
- Posts: 1
- Joined: Mon Oct 07, 2024 4:02 pm
Re: OCR changes English font to illegible characters
My best guess going over the submissions, Loki_991, is that the image size is such that the Enhanced OCR cannot distinguish font from image. I'd had better luck with larger images with text. My suspicion is that the OCR engine is treating this as purely a graphic image, as opposed to image text, though why it would not just leave it alone is unclear.
Re: OCR changes English font to illegible characters
Hi,
As I said I already did the same with various PDF so it's not related to size. Hence I came to that conclusion.
The "fake distorted fonts" from Adobe Acrobat is the only solution.
@theCornflowertheCornflower wrote: ↑Mon Oct 07, 2024 4:07 pm My best guess going over the submissions, Loki_991, is that the image size is such that the Enhanced OCR cannot distinguish font from image. I'd had better luck with larger images with text. My suspicion is that the OCR engine is treating this as purely a graphic image, as opposed to image text, though why it would not just leave it alone is unclear.
As I said I already did the same with various PDF so it's not related to size. Hence I came to that conclusion.
The "fake distorted fonts" from Adobe Acrobat is the only solution.
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- Daniel - PDF-XChange
- Site Admin
- Posts: 10884
- Joined: Wed Jan 03, 2018 6:52 pm
Re: OCR changes English font to illegible characters
Hello, everyone
@Loki, Adobe has many of the same issues we do with recognizing the actual content, as a few quick examples (below are two Gif animations, click to play): Both of these are from the Adobe side of your examples, while they do use a different method from ours to "emulate" retention of the original text, this is really just creating a new font, using a smoothed version of what was on the page as the glyph. Then they assign it to a recognized character (which may be incorrect). While it does "visually" appear correct to a human, it is very much still just as volatile, you just don't notice it right away because they are better at masking these mistakes.
Going back to my earlier statement, no OCR engine is perfect, ours is not an exception, and neither is Adobe.
Regarding the color change you mentioned, I am quite colorblind (especially with the blue spectrum), so visually I see no difference, but confirming with a colleague whos eyes work properly, then grabbing the color picker and comparing, there it does appear that the color is just a normalized value given the various colors of the rasterized text content: It would be impossible to find a perfect match given the variance in color over each pixel of the original image, so this is within an acceptable margin of difference.
Finally, the missing content with "find page content" mode. This could be an issue, but to my understanding it is expected that some (non-OCRable) content goes missing during this operation. As with the other items, I will be passing this along to the Dev team for review nonetheless. To my knowledge Adobe does not offer a fine page content option, so comparing it to the "editable text and images" option is more accurate.
@Cornflower, These items are recognized as text because they are inline with text, nearly the same size as the text content, and the shapes are close enough to letters to be caught, even if what is recognized is the border around the letter, and not the actual content.
As with any report of trouble in the OCR, because it is not perfect, there is always room for improvement. All of this will be moving up to the Dev team, who will communicate with ABBYY's team, and see what can be done to help. But as the OCR engine is provided to us by a third party, I cannot promise any timelines on fixes or to feature improvements.
Kind regards,
@Loki, Adobe has many of the same issues we do with recognizing the actual content, as a few quick examples (below are two Gif animations, click to play): Both of these are from the Adobe side of your examples, while they do use a different method from ours to "emulate" retention of the original text, this is really just creating a new font, using a smoothed version of what was on the page as the glyph. Then they assign it to a recognized character (which may be incorrect). While it does "visually" appear correct to a human, it is very much still just as volatile, you just don't notice it right away because they are better at masking these mistakes.
Going back to my earlier statement, no OCR engine is perfect, ours is not an exception, and neither is Adobe.
Regarding the color change you mentioned, I am quite colorblind (especially with the blue spectrum), so visually I see no difference, but confirming with a colleague whos eyes work properly, then grabbing the color picker and comparing, there it does appear that the color is just a normalized value given the various colors of the rasterized text content: It would be impossible to find a perfect match given the variance in color over each pixel of the original image, so this is within an acceptable margin of difference.
Finally, the missing content with "find page content" mode. This could be an issue, but to my understanding it is expected that some (non-OCRable) content goes missing during this operation. As with the other items, I will be passing this along to the Dev team for review nonetheless. To my knowledge Adobe does not offer a fine page content option, so comparing it to the "editable text and images" option is more accurate.
@Cornflower, These items are recognized as text because they are inline with text, nearly the same size as the text content, and the shapes are close enough to letters to be caught, even if what is recognized is the border around the letter, and not the actual content.
As with any report of trouble in the OCR, because it is not perfect, there is always room for improvement. All of this will be moving up to the Dev team, who will communicate with ABBYY's team, and see what can be done to help. But as the OCR engine is provided to us by a third party, I cannot promise any timelines on fixes or to feature improvements.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Re: OCR changes English font to illegible characters
Hi Daniel,
I'm fine with PDFXCE and Adobe Acrobat having wrong character recognition if at least the content is "visually" the same (which is not the case with PDFXCE Editable or Fine Page Content output). In other words, I prioritize "visual fidelity" over recognition accuracy.
Another good point of Adobe Acrobat "fake fonts" is that it significantly reduce the file size.
Looks like only Adobe Acrobat has that prioritize "visual fidelity" feature. It was called ClearScan in versions prior to Adobe Acrobat XI but they renamed it to Editable output.
I'm aware that I can simply use PDFXCE Searchable output but as I said the way Adobe Acrobat handles it drastically reduce the file size and add some clarity as well cause fonts are vectorized.
Thus, what I would like you to pass to the Dev team is not recognition accuracy but that "visual fidelity" with "fake fonts" output.
Having that feature in PDFXCE will be fantastic.
Thank you,
You're right. I totally agree that no OCR is perfect at recognition. But what I'm talking about here doesn't really concern accuracy recoginition.TrackerSupp-Daniel wrote: ↑Mon Oct 07, 2024 7:17 pm @Loki, Adobe has many of the same issues we do with recognizing the actual content, as a few quick examples
I'm fine with PDFXCE and Adobe Acrobat having wrong character recognition if at least the content is "visually" the same (which is not the case with PDFXCE Editable or Fine Page Content output). In other words, I prioritize "visual fidelity" over recognition accuracy.
Another good point of Adobe Acrobat "fake fonts" is that it significantly reduce the file size.
I pointed the color of the table border which should be red not the text though.TrackerSupp-Daniel wrote: ↑Mon Oct 07, 2024 7:17 pm Regarding the color change you mentioned, I am quite colorblind (especially with the blue spectrum), so visually I see no difference, but confirming with a colleague whos eyes work properly, then grabbing the color picker and comparing, there it does appear that the color is just a normalized value given the various colors of the rasterized text content
Unfortunately I don't think that ABBY's team will be much of help for this. I had a copy of ABBY FineReader 16 back then and had the same issue. Tried Foxit PDF OCR and same issue.TrackerSupp-Daniel wrote: ↑Mon Oct 07, 2024 7:17 pm As with any report of trouble in the OCR, because it is not perfect, there is always room for improvement. All of this will be moving up to the Dev team, who will communicate with ABBYY's team, and see what can be done to help. But as the OCR engine is provided to us by a third party, I cannot promise any timelines on fixes or to feature improvements.
Looks like only Adobe Acrobat has that prioritize "visual fidelity" feature. It was called ClearScan in versions prior to Adobe Acrobat XI but they renamed it to Editable output.
I'm aware that I can simply use PDFXCE Searchable output but as I said the way Adobe Acrobat handles it drastically reduce the file size and add some clarity as well cause fonts are vectorized.
Thus, what I would like you to pass to the Dev team is not recognition accuracy but that "visual fidelity" with "fake fonts" output.
Having that feature in PDFXCE will be fantastic.
Thank you,
Major Stylus topics
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- RemoveAnnotationsWithEraser T#6903
- MiniPopupMenuOnTextSelection T#6894
- AbnormalSpikes forum.pdf-xchange.com/viewtopic.php?p=179935&hilit=spikes#p179935
- ForceEraserPreview forum.pdf-xchange.com/viewtopic.php?t=42380
- Daniel - PDF-XChange
- Site Admin
- Posts: 10884
- Joined: Wed Jan 03, 2018 6:52 pm
Re: OCR changes English font to illegible characters
Hello, Loki@99
Unfortunately that is not a feature we can offer at this time. That would be something the OCR engine needs to develop, and which we would need to update the version of the engine we use in order to benefit from. At this time it is something which our OCR guys are well aware of the demand for, but not something we can promise will be coming to the Editor in any discernable timeframe.
Kind regards,
Unfortunately that is not a feature we can offer at this time. That would be something the OCR engine needs to develop, and which we would need to update the version of the engine we use in order to benefit from. At this time it is something which our OCR guys are well aware of the demand for, but not something we can promise will be coming to the Editor in any discernable timeframe.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com