images with text
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
images with text
another question for OCR - we seem to be losing a lot when the OCR tries to recognize/index text and images in an image.
here are some examples: is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?
here are some examples: is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 2432
- Joined: Mon Jan 15, 2018 9:01 am
Re: images with text
Hello, makesdocs,
May I ask you to send us the original file?
You can skip this one page from the automatic OCR and then use the manual method to recognize only the areas you want:
Regards.
May I ask you to send us the original file?
is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?
You can skip this one page from the automatic OCR and then use the manual method to recognize only the areas you want:
Regards.
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 1825
- Joined: Sat Sep 11, 2021 5:04 am
Re: images with text
Hi Dav,
Best regards,
rakunavi
How about covering the areas you don't want to be recognized with the rectangle comment tool? If you uncheck "Ignore comments on page" in OCR dialog, the text hidden under the comments will not be recognized.
Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
-
- Site Admin
- Posts: 19913
- Joined: Mon Jan 12, 2009 8:07 am
Re: images with text
Hello rakunavi,
Thanks for the suggestion. I've not thought of this approach!
I hope @makesdocs finds both suggestions above useful!
Kind regards,
Stefan
Thanks for the suggestion. I've not thought of this approach!
I hope @makesdocs finds both suggestions above useful!
Kind regards,
Stefan
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
good idea that may well work - I'll try that
we aren't working on this reference material capture full time, but every few days scan a chapter.
now we are seeing what is the art of the possible after 3 chapters, we are going to need to update the process to allow for all the image and text editing etc...
we aren't working on this reference material capture full time, but every few days scan a chapter.
now we are seeing what is the art of the possible after 3 chapters, we are going to need to update the process to allow for all the image and text editing etc...
-
- Site Admin
- Posts: 7388
- Joined: Wed Mar 25, 2009 10:37 pm
Re: images with text
Good luck and be sure to let us know if you need help. I am keen to hear how it works out when all is said and done.
Best regards
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
not sure if I am adding comments properly as per Rakunavi's suggestions, but heres what happened so far:
I added a comment text box to cover the image. I then OCR'd the image with the Ignore Comments box UNCHECKED as per guidance I then went to the newly OCR'd file (Tab Group on the right in below image), and looked at what was behind the comment box found a few (four, I think) blank (text?) boxes seeming inserted over the image removed these 4 artifacts and the image is clear, no editing needed, so that is super helpful!
any idea as to why these empty boxes are being added?
Im setting the comment boxes such that the OCR scans and indexes each figure name/#/description, so they are searchable, but the images will be clear and not indexed, as that's not needed anyway IMO.
very cool
I added a comment text box to cover the image. I then OCR'd the image with the Ignore Comments box UNCHECKED as per guidance I then went to the newly OCR'd file (Tab Group on the right in below image), and looked at what was behind the comment box found a few (four, I think) blank (text?) boxes seeming inserted over the image removed these 4 artifacts and the image is clear, no editing needed, so that is super helpful!
any idea as to why these empty boxes are being added?
Im setting the comment boxes such that the OCR scans and indexes each figure name/#/description, so they are searchable, but the images will be clear and not indexed, as that's not needed anyway IMO.
very cool
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 11583
- Joined: Wed Jan 03, 2018 6:52 pm
Re: images with text
Hello, makesdocs
Could I ask you to send us a copy of this document to take a look at? It is certainly odd that new invisible text boxes are being added by the "Editable" EOCR process while a blank shape is on top. If anything, I would expect to see visible text within those boxes, but in practice, they simply shouldn't appear at all.
If the file contains sensitive information you do not want to post on the public forums, you can email us with a link to this post, via support@pdf-xchange.com
Kind regards,
Could I ask you to send us a copy of this document to take a look at? It is certainly odd that new invisible text boxes are being added by the "Editable" EOCR process while a blank shape is on top. If anything, I would expect to see visible text within those boxes, but in practice, they simply shouldn't appear at all.
If the file contains sensitive information you do not want to post on the public forums, you can email us with a link to this post, via support@pdf-xchange.com
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
attachedare is 1 file - the OCR'd version of that file (see settings in my previous post - they haven't changed) WITHOUT any edits post OCR (attached herein)
the 2nd is the prepped pdf (added comment boxes to exclude image analysis by OCR engine) - this is sent via the file upload utility
the 2nd is the prepped pdf (added comment boxes to exclude image analysis by OCR engine) - this is sent via the file upload utility
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 11583
- Joined: Wed Jan 03, 2018 6:52 pm
Re: images with text
Hello, makesdocs
It seems you uploaded the same file twice, but from what I see here, these are not text boxes, they are "shapes". I do not believe that our OCR adds shapes to the document, are you certain that there were not already blank spaces in this file before OCR processing occurred?
Kind regards,
It seems you uploaded the same file twice, but from what I see here, these are not text boxes, they are "shapes". I do not believe that our OCR adds shapes to the document, are you certain that there were not already blank spaces in this file before OCR processing occurred?
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
no, not sure at all
the files look alike, but the pre and post OCR files really are quite different... if I didn't provide you with 1 of each for a single chapter, I'll redo it if you like
we are scanning the source content (bound book) on a flatbed with NAPS2, which is 1 TIF file per page, a single chapter at a time
running an image cleanup & cropping process with XnConvert on these files
then combining all the TIF images into a single PDF using Pixillion
then opening that PDF with PDF-Xchange to rotate everyother page
so there is a lot of opportunity for artifacts to get introduced... it might be the Pixillion PDF conversion, since the TIFF files are just images...
the files look alike, but the pre and post OCR files really are quite different... if I didn't provide you with 1 of each for a single chapter, I'll redo it if you like

we are scanning the source content (bound book) on a flatbed with NAPS2, which is 1 TIF file per page, a single chapter at a time
running an image cleanup & cropping process with XnConvert on these files
then combining all the TIF images into a single PDF using Pixillion
then opening that PDF with PDF-Xchange to rotate everyother page
so there is a lot of opportunity for artifacts to get introduced... it might be the Pixillion PDF conversion, since the TIFF files are just images...
-
- Site Admin
- Posts: 11583
- Joined: Wed Jan 03, 2018 6:52 pm
Re: images with text
Hello, makesdocs
Kind regards,
Oh no sorry, I mean it looks like you mistakenly uploaded the exact same file twice, they have the same name, file size, etc, and when opening them side by side, both have all the same details: And the same metadata information, like modification date: It seems that you simply dragged the same file twice, even if you had to separate files that you intended to send.
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
oh I see what you are saying.
yeah I double dragged to the forum thread page, for 2 uploads. but isn't the file at the upload page different?
1) the double copy to the forum page was titled "Chap1-Rotated OCRhigh_noedits.pdf"
a PDF file, pages all rotated up
2 the upload file was titled "Chap1-Rotated_OCRprep.pdf"
same PDF file, pages all rotated up, but with comment blocks added to get OCR to ignore the page graphics/images...
yeah I double dragged to the forum thread page, for 2 uploads. but isn't the file at the upload page different?
1) the double copy to the forum page was titled "Chap1-Rotated OCRhigh_noedits.pdf"
a PDF file, pages all rotated up
2 the upload file was titled "Chap1-Rotated_OCRprep.pdf"
same PDF file, pages all rotated up, but with comment blocks added to get OCR to ignore the page graphics/images...
-
- Site Admin
- Posts: 11583
- Joined: Wed Jan 03, 2018 6:52 pm
Re: images with text
Hello, makesdocs
Unfortunately I only see the same as you do above, we seem to only have the "no-edits" version of the file. Could you try uploading the other file again, or send it to us via email with a link to this post? our email address is support@pdf-xchange.com
Kind regards,
Unfortunately I only see the same as you do above, we seem to only have the "no-edits" version of the file. Could you try uploading the other file again, or send it to us via email with a link to this post? our email address is support@pdf-xchange.com
Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
hey Paul - I've uploaded the post OCR file, WITH the edits.
edits are:
updating inline text as needed. a few times so far it's been easier to take a small snapshot of a formula or fraction that is inline with text, and paste it, as the image behind the OCR text had an artifact, so the small image snapshot would hide that artifact and allow a readable word/equation/fraction etc...
edits in our vernacular can also be a snapshot of a diagram, if it was hopelessy munged, or if we were lucky, just a removal of the comment box and any artifact white boxes that were in front of the image...
edits are:
updating inline text as needed. a few times so far it's been easier to take a small snapshot of a formula or fraction that is inline with text, and paste it, as the image behind the OCR text had an artifact, so the small image snapshot would hide that artifact and allow a readable word/equation/fraction etc...
edits in our vernacular can also be a snapshot of a diagram, if it was hopelessy munged, or if we were lucky, just a removal of the comment box and any artifact white boxes that were in front of the image...
-
- Site Admin
- Posts: 7388
- Joined: Wed Mar 25, 2009 10:37 pm
Re: images with text
Hi makedocs,
the annotation over the base content is indeed easily removed, showing the underlying content.
You can select an image and make it "Base Content" and in this case it then does cover what was underneath.:
Will that give you what you seek?
the annotation over the base content is indeed easily removed, showing the underlying content.
You can select an image and make it "Base Content" and in this case it then does cover what was underneath.:
Will that give you what you seek?
You do not have the required permissions to view the files attached to this post.
Best regards
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
"Will that give you what you seek?"
I don't know, but Im going to try that, see if it works lol!
thanks again
David
I don't know, but Im going to try that, see if it works lol!
thanks again
David
-
- Site Admin
- Posts: 11583
- Joined: Wed Jan 03, 2018 6:52 pm
images with text

Dan McIntyre - Support Technician
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
PDF-XChange Co. LTD
+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
so I've had a chance to do some further investigation
the "artifacts" (empty? boxes that I think you are calling annotations) are for sure only coming up via the OCR effort. they don't exist anywhere in the capture / image refinement / pdf conversion / page rotation process chain - they only arise after the OCR. This is either with or without putting comment boxes over images to exclude them from OCR.
some observations questions:
1) even with a comment box covering an image completely, sometimes a portion of an image has OCR applied to it... it doesn't seem to be consistent across images, or even within a page, without chaning settings... - so far the solution is, for these instances, to just place a copy of the original image over it
2) I wasn't getting anywhere trying the "zonal OCR" to capture forumlas... do I do this BEFORE I OCR an entire document? if so, how does that output get merged into document?
3) similarly, I wasn't having any luck using the annontation via image select/make base content to flatten selected comments. does this get performed on the "discovered" annotations post OCR, and somehow brings the image to the foreground?
lastly, I purchased a license for the tool, since it's great and your support has also been terrific. one last question though:
4) if we have a document that was processed with a demo version, that has the DEMO VERSION stamps in the upper corners of the document, how do we get rid of those now that we have a fully licensed copy?
thanks again for all your help
best
David
thanks again
the "artifacts" (empty? boxes that I think you are calling annotations) are for sure only coming up via the OCR effort. they don't exist anywhere in the capture / image refinement / pdf conversion / page rotation process chain - they only arise after the OCR. This is either with or without putting comment boxes over images to exclude them from OCR.
some observations questions:
1) even with a comment box covering an image completely, sometimes a portion of an image has OCR applied to it... it doesn't seem to be consistent across images, or even within a page, without chaning settings... - so far the solution is, for these instances, to just place a copy of the original image over it
2) I wasn't getting anywhere trying the "zonal OCR" to capture forumlas... do I do this BEFORE I OCR an entire document? if so, how does that output get merged into document?
3) similarly, I wasn't having any luck using the annontation via image select/make base content to flatten selected comments. does this get performed on the "discovered" annotations post OCR, and somehow brings the image to the foreground?
lastly, I purchased a license for the tool, since it's great and your support has also been terrific. one last question though:
4) if we have a document that was processed with a demo version, that has the DEMO VERSION stamps in the upper corners of the document, how do we get rid of those now that we have a fully licensed copy?
thanks again for all your help
best
David
thanks again
-
- Site Admin
- Posts: 19913
- Joined: Mon Jan 12, 2009 8:07 am
Re: images with text
Hello makesdocs,
I will use the same notation as you did in the post above to keep my replies easy to follow
1) Can you prepare a sample for this - we'd like to see if we can reproduce this!
2) Yes - if you do the formula zonal OCR before OCRing the rest of the file - these portions of your page would no longer contain images, and as such would be skipped by the main OCR process that is run on the whole file (and maybe does not include the 'Mathematical Formulas' language).
3) Flattening comments into base content objects is a process on it's own and you should be able to flatten annotations at any point - but the flattened objects would then be treated differently than actual annotations for when you run the OCR and it tries to determine what to skip.
4) Try the Watermarks -> Remove All... inside the Editor and see if that helps!
Kind regards,
Stefan
I will use the same notation as you did in the post above to keep my replies easy to follow
1) Can you prepare a sample for this - we'd like to see if we can reproduce this!
2) Yes - if you do the formula zonal OCR before OCRing the rest of the file - these portions of your page would no longer contain images, and as such would be skipped by the main OCR process that is run on the whole file (and maybe does not include the 'Mathematical Formulas' language).
3) Flattening comments into base content objects is a process on it's own and you should be able to flatten annotations at any point - but the flattened objects would then be treated differently than actual annotations for when you run the OCR and it tries to determine what to skip.
4) Try the Watermarks -> Remove All... inside the Editor and see if that helps!
Kind regards,
Stefan
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
you folks are so responsive! thanks for the additional clarity I'll give these a shot and report back.
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
I will upload the files for you all to use to try out and replicate our issues later tonight or tomorrow
for now, here is a document with screenshots from the scanning effort that shows you step by step what was happening
1st section is an example of my attempt at "zonal" OCR - not really working 2nd section is an example of an image with inconsistent OCR occurring on images
2a: "in-image text" not being consistently excluded from OCR process while being under the comment box area
2b: the OCR process also creating quite a few (empty? annotations?) artifact boxes that are laid over the image that was have to be removed to be able to fully view the image in the OCR'd file
for now, here is a document with screenshots from the scanning effort that shows you step by step what was happening
1st section is an example of my attempt at "zonal" OCR - not really working 2nd section is an example of an image with inconsistent OCR occurring on images
2a: "in-image text" not being consistently excluded from OCR process while being under the comment box area
2b: the OCR process also creating quite a few (empty? annotations?) artifact boxes that are laid over the image that was have to be removed to be able to fully view the image in the OCR'd file
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 32
- Joined: Sun Dec 25, 2022 7:28 pm
Re: images with text
hey folks - got the example files that match the previous post's example document, so you all can try to replicate/examine what is going on
1) Chap6-Rotated
this is the scanned images, massaged for luminosity etc., pages rotated to all be in correct orientation.
2) Chap6-Rotated_OCRprep
this is the Chap6-Rotated file, with comment text boxes added throughout the document in order to try and exclude the underlying content from OCR (let the image of the actual content be included)
3) Chap6-Rotated_OCRprephigh_edited
this is the Chap6-Rotated_OCRprephighedited file, that has been OCR'd using "high" OCR setting, and then undergone the process of removing all comment boxes, "annotation artifacts" and all sorts of editing to try and clean up OCR content and/or re-insert original content as images to maintain final output integrity with input document
1) Chap6-Rotated
this is the scanned images, massaged for luminosity etc., pages rotated to all be in correct orientation.
2) Chap6-Rotated_OCRprep
this is the Chap6-Rotated file, with comment text boxes added throughout the document in order to try and exclude the underlying content from OCR (let the image of the actual content be included)
3) Chap6-Rotated_OCRprephigh_edited
this is the Chap6-Rotated_OCRprephighedited file, that has been OCR'd using "high" OCR setting, and then undergone the process of removing all comment boxes, "annotation artifacts" and all sorts of editing to try and clean up OCR content and/or re-insert original content as images to maintain final output integrity with input document
-
- Site Admin
- Posts: 19913
- Joined: Mon Jan 12, 2009 8:07 am
Re: images with text
Hello makesdocs,
Apologies for the delay in getting back to you on this topic!
I've had a go at your sample files - and indeed nothing seems to be the "fit all" solution.
Your files are scanned books with text and graphics in them, and there are both formulas and text over images (which you'd like to not process).
So I can't really come up with a universal solution that would allow you to process e.g. 20 such books within a few hours.
But if it's a one off - then a semi manual approach might give you the best results.
You can e.g. make copies of pages that have those graphics - and export those to a separate file.
Then OCR the original, remove 'messed up' pages where the OCR did affect your images, and then re-import the copies of those pages from the other file.
Once that is done - you can then do manual ocr on only those complex pages - e.g. you can try the "zonal OCR" - but this time on the actual text blocks, so that only they are processed.
And finally manually typing in those formulas that are in the main text if necessary.
That is the workflow I could come up with that will give the best possible result while converting the most amount of scanned images to actual text, but it is definitely not an automated approach.
Kind regards,
Stefan
Apologies for the delay in getting back to you on this topic!
I've had a go at your sample files - and indeed nothing seems to be the "fit all" solution.
Your files are scanned books with text and graphics in them, and there are both formulas and text over images (which you'd like to not process).
So I can't really come up with a universal solution that would allow you to process e.g. 20 such books within a few hours.
But if it's a one off - then a semi manual approach might give you the best results.
You can e.g. make copies of pages that have those graphics - and export those to a separate file.
Then OCR the original, remove 'messed up' pages where the OCR did affect your images, and then re-import the copies of those pages from the other file.
Once that is done - you can then do manual ocr on only those complex pages - e.g. you can try the "zonal OCR" - but this time on the actual text blocks, so that only they are processed.
And finally manually typing in those formulas that are in the main text if necessary.
That is the workflow I could come up with that will give the best possible result while converting the most amount of scanned images to actual text, but it is definitely not an automated approach.
Kind regards,
Stefan