images with text

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

images with text

Post by makesdocs »

another question for OCR - we seem to be losing a lot when the OCR tries to recognize/index text and images in an image.

here are some examples:
image.png
image(1).png
is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?
You do not have the required permissions to view the files attached to this post.
User avatar
Dimitar - PDF-XChange
Site Admin
Posts: 2432
Joined: Mon Jan 15, 2018 9:01 am

Re: images with text

Post by Dimitar - PDF-XChange »

Hello, makesdocs,

May I ask you to send us the original file?

is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?

You can skip this one page from the automatic OCR and then use the manual method to recognize only the areas you want:
image.png




Regards.
You do not have the required permissions to view the files attached to this post.
User avatar
rakunavi
User
Posts: 1825
Joined: Sat Sep 11, 2021 5:04 am

Re: images with text

Post by rakunavi »

Hi Dav,
makesdocs wrote: Wed Dec 28, 2022 7:52 am is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?
How about covering the areas you don't want to be recognized with the rectangle comment tool? If you uncheck "Ignore comments on page" in OCR dialog, the text hidden under the comments will not be recognized.

Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19913
Joined: Mon Jan 12, 2009 8:07 am

Re: images with text

Post by Stefan - PDF-XChange »

Hello rakunavi,

Thanks for the suggestion. I've not thought of this approach!
I hope @makesdocs finds both suggestions above useful!

Kind regards,
Stefan
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

good idea that may well work - I'll try that

we aren't working on this reference material capture full time, but every few days scan a chapter.

now we are seeing what is the art of the possible after 3 chapters, we are going to need to update the process to allow for all the image and text editing etc...
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7388
Joined: Wed Mar 25, 2009 10:37 pm

Re: images with text

Post by Paul - PDF-XChange »

Good luck and be sure to let us know if you need help. I am keen to hear how it works out when all is said and done.
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

not sure if I am adding comments properly as per Rakunavi's suggestions, but heres what happened so far:

I added a comment text box to cover the image. I then OCR'd the image with the Ignore Comments box UNCHECKED as per guidance
image(1).png
I then went to the newly OCR'd file (Tab Group on the right in below image), and looked at what was behind the comment box
image.png
found a few (four, I think) blank (text?) boxes seeming inserted over the image
image(2).png
image(3).png
image(4).png
image(5).png
removed these 4 artifacts and the image is clear, no editing needed, so that is super helpful!

any idea as to why these empty boxes are being added?

Im setting the comment boxes such that the OCR scans and indexes each figure name/#/description, so they are searchable, but the images will be clear and not indexed, as that's not needed anyway IMO.
very cool
image(6).png
You do not have the required permissions to view the files attached to this post.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

Re: images with text

Post by Daniel - PDF-XChange »

Hello, makesdocs

Could I ask you to send us a copy of this document to take a look at? It is certainly odd that new invisible text boxes are being added by the "Editable" EOCR process while a blank shape is on top. If anything, I would expect to see visible text within those boxes, but in practice, they simply shouldn't appear at all.

If the file contains sensitive information you do not want to post on the public forums, you can email us with a link to this post, via support@pdf-xchange.com

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

attachedare is 1 file - the OCR'd version of that file (see settings in my previous post - they haven't changed) WITHOUT any edits post OCR (attached herein)

the 2nd is the prepped pdf (added comment boxes to exclude image analysis by OCR engine) - this is sent via the file upload utility
Chap1-Rotated OCRhigh_noedits.pdf
image.png
You do not have the required permissions to view the files attached to this post.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

Re: images with text

Post by Daniel - PDF-XChange »

Hello, makesdocs

It seems you uploaded the same file twice, but from what I see here, these are not text boxes, they are "shapes". I do not believe that our OCR adds shapes to the document, are you certain that there were not already blank spaces in this file before OCR processing occurred?

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

no, not sure at all

the files look alike, but the pre and post OCR files really are quite different... if I didn't provide you with 1 of each for a single chapter, I'll redo it if you like :)

we are scanning the source content (bound book) on a flatbed with NAPS2, which is 1 TIF file per page, a single chapter at a time
running an image cleanup & cropping process with XnConvert on these files
then combining all the TIF images into a single PDF using Pixillion
then opening that PDF with PDF-Xchange to rotate everyother page

so there is a lot of opportunity for artifacts to get introduced... it might be the Pixillion PDF conversion, since the TIFF files are just images...
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

Re: images with text

Post by Daniel - PDF-XChange »

Hello, makesdocs
makesdocs wrote: Thu Dec 29, 2022 8:21 pm the files look alike, but the pre and post OCR files really are quite different...
Oh no sorry, I mean it looks like you mistakenly uploaded the exact same file twice, they have the same name, file size, etc, and when opening them side by side, both have all the same details:
image.png
And the same metadata information, like modification date:
PDFXEdit_bLXHeHRUM3.gif
It seems that you simply dragged the same file twice, even if you had to separate files that you intended to send.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

oh I see what you are saying.

yeah I double dragged to the forum thread page, for 2 uploads. but isn't the file at the upload page different?

1) the double copy to the forum page was titled "Chap1-Rotated OCRhigh_noedits.pdf"
a PDF file, pages all rotated up

2 the upload file was titled "Chap1-Rotated_OCRprep.pdf"
same PDF file, pages all rotated up, but with comment blocks added to get OCR to ignore the page graphics/images...
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

Re: images with text

Post by Daniel - PDF-XChange »

Hello, makesdocs

Unfortunately I only see the same as you do above, we seem to only have the "no-edits" version of the file. Could you try uploading the other file again, or send it to us via email with a link to this post? our email address is support@pdf-xchange.com

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

hey Paul - I've uploaded the post OCR file, WITH the edits.

edits are:

updating inline text as needed. a few times so far it's been easier to take a small snapshot of a formula or fraction that is inline with text, and paste it, as the image behind the OCR text had an artifact, so the small image snapshot would hide that artifact and allow a readable word/equation/fraction etc...

edits in our vernacular can also be a snapshot of a diagram, if it was hopelessy munged, or if we were lucky, just a removal of the comment box and any artifact white boxes that were in front of the image...
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7388
Joined: Wed Mar 25, 2009 10:37 pm

Re: images with text

Post by Paul - PDF-XChange »

Hi makedocs,

the annotation over the base content is indeed easily removed, showing the underlying content.

You can select an image and make it "Base Content" and in this case it then does cover what was underneath.:
image.png
Will that give you what you seek?
You do not have the required permissions to view the files attached to this post.
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

"Will that give you what you seek?"

I don't know, but Im going to try that, see if it works lol!

thanks again
David
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

images with text

Post by Daniel - PDF-XChange »

:)
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

so I've had a chance to do some further investigation

the "artifacts" (empty? boxes that I think you are calling annotations) are for sure only coming up via the OCR effort. they don't exist anywhere in the capture / image refinement / pdf conversion / page rotation process chain - they only arise after the OCR. This is either with or without putting comment boxes over images to exclude them from OCR.

some observations questions:
1) even with a comment box covering an image completely, sometimes a portion of an image has OCR applied to it... it doesn't seem to be consistent across images, or even within a page, without chaning settings... - so far the solution is, for these instances, to just place a copy of the original image over it
2) I wasn't getting anywhere trying the "zonal OCR" to capture forumlas... do I do this BEFORE I OCR an entire document? if so, how does that output get merged into document?
3) similarly, I wasn't having any luck using the annontation via image select/make base content to flatten selected comments. does this get performed on the "discovered" annotations post OCR, and somehow brings the image to the foreground?

lastly, I purchased a license for the tool, since it's great and your support has also been terrific. one last question though:
4) if we have a document that was processed with a demo version, that has the DEMO VERSION stamps in the upper corners of the document, how do we get rid of those now that we have a fully licensed copy?

thanks again for all your help
best
David


thanks again
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19913
Joined: Mon Jan 12, 2009 8:07 am

Re: images with text

Post by Stefan - PDF-XChange »

Hello makesdocs,

I will use the same notation as you did in the post above to keep my replies easy to follow
1) Can you prepare a sample for this - we'd like to see if we can reproduce this!
2) Yes - if you do the formula zonal OCR before OCRing the rest of the file - these portions of your page would no longer contain images, and as such would be skipped by the main OCR process that is run on the whole file (and maybe does not include the 'Mathematical Formulas' language).
3) Flattening comments into base content objects is a process on it's own and you should be able to flatten annotations at any point - but the flattened objects would then be treated differently than actual annotations for when you run the OCR and it tries to determine what to skip.
4) Try the Watermarks -> Remove All... inside the Editor and see if that helps!

Kind regards,
Stefan
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

you folks are so responsive! thanks for the additional clarity I'll give these a shot and report back.
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

I will upload the files for you all to use to try out and replicate our issues later tonight or tomorrow

for now, here is a document with screenshots from the scanning effort that shows you step by step what was happening

1st section is an example of my attempt at "zonal" OCR - not really working
PDF-X_OCRissues.pdf
2nd section is an example of an image with inconsistent OCR occurring on images
2a: "in-image text" not being consistently excluded from OCR process while being under the comment box area
2b: the OCR process also creating quite a few (empty? annotations?) artifact boxes that are laid over the image that was have to be removed to be able to fully view the image in the OCR'd file
You do not have the required permissions to view the files attached to this post.
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: images with text

Post by makesdocs »

hey folks - got the example files that match the previous post's example document, so you all can try to replicate/examine what is going on

1) Chap6-Rotated
this is the scanned images, massaged for luminosity etc., pages rotated to all be in correct orientation.
2) Chap6-Rotated_OCRprep
this is the Chap6-Rotated file, with comment text boxes added throughout the document in order to try and exclude the underlying content from OCR (let the image of the actual content be included)
3) Chap6-Rotated_OCRprephigh_edited
this is the Chap6-Rotated_OCRprephighedited file, that has been OCR'd using "high" OCR setting, and then undergone the process of removing all comment boxes, "annotation artifacts" and all sorts of editing to try and clean up OCR content and/or re-insert original content as images to maintain final output integrity with input document
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19913
Joined: Mon Jan 12, 2009 8:07 am

Re: images with text

Post by Stefan - PDF-XChange »

Hello makesdocs,

Apologies for the delay in getting back to you on this topic!
I've had a go at your sample files - and indeed nothing seems to be the "fit all" solution.
Your files are scanned books with text and graphics in them, and there are both formulas and text over images (which you'd like to not process).
So I can't really come up with a universal solution that would allow you to process e.g. 20 such books within a few hours.
But if it's a one off - then a semi manual approach might give you the best results.

You can e.g. make copies of pages that have those graphics - and export those to a separate file.
Then OCR the original, remove 'messed up' pages where the OCR did affect your images, and then re-import the copies of those pages from the other file.
Once that is done - you can then do manual ocr on only those complex pages - e.g. you can try the "zonal OCR" - but this time on the actual text blocks, so that only they are processed.
And finally manually typing in those formulas that are in the main text if necessary.

That is the workflow I could come up with that will give the best possible result while converting the most amount of scanned images to actual text, but it is definitely not an automated approach.

Kind regards,
Stefan