Enhanced OCR quality tuning
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange
-
- User
- Posts: 325
- Joined: Wed Feb 09, 2011 1:06 pm
Enhanced OCR quality tuning
Hi, tried to OCR some pages now and was suprised that the engine tends to replace certain common characters by special characters very often. Another point is that the result might show many different font faces and font sizes within a paragraph, maybe an additional filter could allow to reduce the number of used fonts to a minimum.
The example below shows also, that it is not easy to decide if the medium or high quality setting should be used. The first page (high quality) shows a nearly perfect text but eliminates an image, page two (medium quality) shows multiple errors in text content and placement, also some font variations are seen.
The example below shows also, that it is not easy to decide if the medium or high quality setting should be used. The first page (high quality) shows a nearly perfect text but eliminates an image, page two (medium quality) shows multiple errors in text content and placement, also some font variations are seen.
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR quality tuning
Hi, Puffolino
Could I ask you to send us a copy of the original, before OCR was performed on this page? So we can run a few tests here and see what is happening. Also, you mentioned specifically running this on "high" and "medium". How does the "auto" quality level work for you? What happens on your end if you enable the "ignore text in graphics" option, with regards to the disappearing image?
Kind regards,
Could I ask you to send us a copy of the original, before OCR was performed on this page? So we can run a few tests here and see what is happening. Also, you mentioned specifically running this on "high" and "medium". How does the "auto" quality level work for you? What happens on your end if you enable the "ignore text in graphics" option, with regards to the disappearing image?
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 325
- Joined: Wed Feb 09, 2011 1:06 pm
Re: Enhanced OCR quality tuning
Hi Daniel,
did not find the perfect setting so far, also was wondering that "Fine Page Content" output did create a file which is more than 6 times larger than the original image file.
I've uploaded the original file 'r.pdf' and the new output file 'r (auto).pdf' as well.
Cheers.
did not find the perfect setting so far, also was wondering that "Fine Page Content" output did create a file which is more than 6 times larger than the original image file.
I've uploaded the original file 'r.pdf' and the new output file 'r (auto).pdf' as well.
Cheers.
-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR quality tuning
Hi, Puffolino
My apologies but I do not seem to see these new sample files you say you've uploaded, where are they?
Kind regards,
My apologies but I do not seem to see these new sample files you say you've uploaded, where are they?
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 325
- Joined: Wed Feb 09, 2011 1:06 pm
Re: Enhanced OCR quality tuning
Oh, didn't mention that - sorry.
The files are in the Temp (OCR) directory of the user server for uploads.

The files are in the Temp (OCR) directory of the user server for uploads.

-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR quality tuning
Hi, Puffolino
Understood, I will go look there. Out of curiosity, is there a reason you didn't simply attach them to your forum post like you did earlier, I see that they are certainly small enough files that we could have placed them here (and if you are not against it, I would like to attach them to one of these posts to keep the topic complete and self contained).
Kind regards,
Understood, I will go look there. Out of curiosity, is there a reason you didn't simply attach them to your forum post like you did earlier, I see that they are certainly small enough files that we could have placed them here (and if you are not against it, I would like to attach them to one of these posts to keep the topic complete and self contained).
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR quality tuning
Hi again,
I have finished looking into the files here, I do see that the Auto mode resulted in a fairly good quality result, I could not identify any errors in the text aside from some sections of text being slightly taller than the original image. I did however see what you mean about the vastly increased size, and have created a ticket on that matter for you:
RT#5515: EOCR Auto-quality increases image size substantially
Kind regards,
I have finished looking into the files here, I do see that the Auto mode resulted in a fairly good quality result, I could not identify any errors in the text aside from some sections of text being slightly taller than the original image. I did however see what you mean about the vastly increased size, and have created a ticket on that matter for you:
RT#5515: EOCR Auto-quality increases image size substantially
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 325
- Joined: Wed Feb 09, 2011 1:06 pm
Re: Enhanced OCR quality tuning
Seems that the auto result is far better compared to the other settings - there's still one point I couldn't fix: all delimeters of the hyphenations are lost ('Zahlen- system' -> 'Zahlen system' and so on)...
-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR quality tuning
Hi, Puffolino
Thank you for that pointer, after looking again I do see this happening in the file you sent me, Ill make a note of that for the Dev team to look into as well.
Kind regards,
Thank you for that pointer, after looking again I do see this happening in the file you sent me, Ill make a note of that for the Dev team to look into as well.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 325
- Joined: Wed Feb 09, 2011 1:06 pm
Re: Enhanced OCR quality tuning
Thanks for that as well 

-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Enhanced OCR quality tuning

Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 578
- Joined: Mon Sep 13, 2021 8:12 am
Re: Enhanced OCR quality tuning
With low contrast and non-uniform background, the OCR quality in Auto mode is unsatisfactory:
͏
͏
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 11010
- Joined: Wed Jan 03, 2018 6:52 pm
Re: Enhanced OCR quality tuning
Hello, Jensen Head
Low contrast areas are of notable difficulty, as are items like "colored text" (especially on a similar background). The DPI for the images is also 72, which is well below the optimal range of around 200-600dpi that OCR operates best at, and the physical page size is inflating the OCR process considerably.
With all that said, There is a process to get a good result here.
1. Use the "Colorize docuemnts" tool, to change the inherent coloring of this document to a black focus (Giving it a greyscale tone, and removing one barrier for the OCR engine): 2. Then use the "Enhance scans" feature, with everything disabled except background removal (Set to Auto) and descreen (set to high) to severely improve the contrast for these paragraphs (another OCR barrier): 3. Resize and normalize the page down to Letter or A4, increasing the DPI to above 350 (yet another way to help OCR): 4. Finally processing it through OCR: In the end, the document came out very nice: After this testing, and some discussion with the dev team, we did identify one key issue from inspecting your document. There will be times when a document like you have is actually a decently high DPI image, but for some reason, is improperly scaled. In this case, it is ~5x the "intended" sheet size, making the DPI seem low, which confuses the OCR engine, giving poorer result than it should otherwise be able to.
We are not sure how to address this issue, since the ratio and optimal range will vary greatly based on range of font sizes, but we are looking into that particular aspect now.
Kind regards,
Low contrast areas are of notable difficulty, as are items like "colored text" (especially on a similar background). The DPI for the images is also 72, which is well below the optimal range of around 200-600dpi that OCR operates best at, and the physical page size is inflating the OCR process considerably.
With all that said, There is a process to get a good result here.
1. Use the "Colorize docuemnts" tool, to change the inherent coloring of this document to a black focus (Giving it a greyscale tone, and removing one barrier for the OCR engine): 2. Then use the "Enhance scans" feature, with everything disabled except background removal (Set to Auto) and descreen (set to high) to severely improve the contrast for these paragraphs (another OCR barrier): 3. Resize and normalize the page down to Letter or A4, increasing the DPI to above 350 (yet another way to help OCR): 4. Finally processing it through OCR: In the end, the document came out very nice: After this testing, and some discussion with the dev team, we did identify one key issue from inspecting your document. There will be times when a document like you have is actually a decently high DPI image, but for some reason, is improperly scaled. In this case, it is ~5x the "intended" sheet size, making the DPI seem low, which confuses the OCR engine, giving poorer result than it should otherwise be able to.
We are not sure how to address this issue, since the ratio and optimal range will vary greatly based on range of font sizes, but we are looking into that particular aspect now.
Kind regards,
You do not have the required permissions to view the files attached to this post.
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 578
- Joined: Mon Sep 13, 2021 8:12 am
"Colorize Document" + "Convert Colors" or "Convert Colors" only?
Thank you for the detailed instructions! It worked for the rest of the pages of this document, as well as for other similar ones (for example, scanned old books with darkened pages). It's a pity that in order to preserve the original appearance of the document and recode the original images, I first have to create a copy of the document, then perform the transformations you listed, then delete all non-text objects, convert the page sizes to the previous one (in general, all pages have different sizes), and then combine the documents using the "Overlay" command. But I am satisfied with the result.
A clarifying question about converting raster images to grayscale (or in other situations when, for example, it is necessary to reduce the document size by removing excess color information): does it make sense to spend time on "Colorize Document" with the solid black color specified (before using the "Convert Colors"), or is it enough to use only the "Convert Colors" command? After all, "Convert Colors" also converts colors to grayscale (and, additionally, replaces the color space / profile).
A clarifying question about converting raster images to grayscale (or in other situations when, for example, it is necessary to reduce the document size by removing excess color information): does it make sense to spend time on "Colorize Document" with the solid black color specified (before using the "Convert Colors"), or is it enough to use only the "Convert Colors" command? After all, "Convert Colors" also converts colors to grayscale (and, additionally, replaces the color space / profile).
-
- Site Admin
- Posts: 19853
- Joined: Mon Jan 12, 2009 8:07 am
Re: Enhanced OCR quality tuning
Hello Jensen Head,
Glad to hear that you are getting good results!
And if you are happy with just the "Convert Colours" and the result you get after it - Colorize Document might indeed be skipped. It will be up to you to decide whether that step is necessary in your workflow and with the quality you have for your initial scans of your documents.
Kind regards,
Stefan
Glad to hear that you are getting good results!
And if you are happy with just the "Convert Colours" and the result you get after it - Colorize Document might indeed be skipped. It will be up to you to decide whether that step is necessary in your workflow and with the quality you have for your initial scans of your documents.
Kind regards,
Stefan