images with text

makesdocs · Post by **makesdocs** » Wed Dec 28, 2022 7:52 am

another question for OCR - we seem to be losing a lot when the OCR tries to recognize/index text and images in an image.

here are some examples:

image.png

image(1).png

is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?

Wed Dec 28, 2022 8:21 am

Hello, makesdocs,

May I ask you to send us the original file?

is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?

You can skip this one page from the automatic OCR and then use the manual method to recognize only the areas you want:

image.png

Regards.

rakunavi · Post by **rakunavi** » Wed Dec 28, 2022 9:49 am

Hi Dav,

makesdocs wrote: ↑Wed Dec 28, 2022 7:52 am is there a way to mark areas around images to exclude them from OCR, and just have the marked area translated as an image to the OCR document?

How about covering the areas you don't want to be recognized with the rectangle comment tool? If you uncheck "Ignore comments on page" in OCR dialog, the text hidden under the comments will not be recognized.

Best regards,
rakunavi

Wed Dec 28, 2022 11:09 am

Hello rakunavi,

Thanks for the suggestion. I've not thought of this approach!
I hope @makesdocs finds both suggestions above useful!

Kind regards,
Stefan

makesdocs · Post by **makesdocs** » Wed Dec 28, 2022 8:03 pm

good idea that may well work - I'll try that

we aren't working on this reference material capture full time, but every few days scan a chapter.

now we are seeing what is the art of the possible after 3 chapters, we are going to need to update the process to allow for all the image and text editing etc...

Post by **Paul - PDF-XChange** » Wed Dec 28, 2022 8:43 pm

Good luck and be sure to let us know if you need help. I am keen to hear how it works out when all is said and done.

makesdocs · Post by **makesdocs** » Thu Dec 29, 2022 5:21 pm

not sure if I am adding comments properly as per Rakunavi's suggestions, but heres what happened so far:

I added a comment text box to cover the image. I then OCR'd the image with the Ignore Comments box UNCHECKED as per guidance

image(1).png

I then went to the newly OCR'd file (Tab Group on the right in below image), and looked at what was behind the comment box

image.png

found a few (four, I think) blank (text?) boxes seeming inserted over the image

image(2).png

image(3).png

image(4).png

image(5).png

removed these 4 artifacts and the image is clear, no editing needed, so that is super helpful!

any idea as to why these empty boxes are being added?

Im setting the comment boxes such that the OCR scans and indexes each figure name/#/description, so they are searchable, but the images will be clear and not indexed, as that's not needed anyway IMO.
very cool

image(6).png

Thu Dec 29, 2022 7:47 pm

Hello, makesdocs

Could I ask you to send us a copy of this document to take a look at? It is certainly odd that new invisible text boxes are being added by the "Editable" EOCR process while a blank shape is on top. If anything, I would expect to see visible text within those boxes, but in practice, they simply shouldn't appear at all.

If the file contains sensitive information you do not want to post on the public forums, you can email us with a link to this post, via support@pdf-xchange.com

Kind regards,

makesdocs · Post by **makesdocs** » Thu Dec 29, 2022 8:05 pm

attachedare is 1 file - the OCR'd version of that file (see settings in my previous post - they haven't changed) WITHOUT any edits post OCR (attached herein)

the 2nd is the prepped pdf (added comment boxes to exclude image analysis by OCR engine) - this is sent via the file upload utility

Chap1-Rotated OCRhigh_noedits.pdf

image.png

Thu Dec 29, 2022 8:10 pm

Hello, makesdocs

It seems you uploaded the same file twice, but from what I see here, these are not text boxes, they are "shapes". I do not believe that our OCR adds shapes to the document, are you certain that there were not already blank spaces in this file before OCR processing occurred?

Kind regards,

makesdocs · Post by **makesdocs** » Thu Dec 29, 2022 8:21 pm

no, not sure at all

the files look alike, but the pre and post OCR files really are quite different... if I didn't provide you with 1 of each for a single chapter, I'll redo it if you like

we are scanning the source content (bound book) on a flatbed with NAPS2, which is 1 TIF file per page, a single chapter at a time
running an image cleanup & cropping process with XnConvert on these files
then combining all the TIF images into a single PDF using Pixillion
then opening that PDF with PDF-Xchange to rotate everyother page

so there is a lot of opportunity for artifacts to get introduced... it might be the Pixillion PDF conversion, since the TIFF files are just images...

Thu Dec 29, 2022 8:42 pm

Hello, makesdocs

makesdocs wrote: ↑Thu Dec 29, 2022 8:21 pm the files look alike, but the pre and post OCR files really are quite different...

Oh no sorry, I mean it looks like you mistakenly uploaded the exact same file twice, they have the same name, file size, etc, and when opening them side by side, both have all the same details:

image.png

And the same metadata information, like modification date:

PDFXEdit_bLXHeHRUM3.gif

It seems that you simply dragged the same file twice, even if you had to separate files that you intended to send.

Kind regards,

makesdocs · Post by **makesdocs** » Sun Jan 01, 2023 11:27 pm

oh I see what you are saying.

yeah I double dragged to the forum thread page, for 2 uploads. but isn't the file at the upload page different?

1) the double copy to the forum page was titled "Chap1-Rotated OCRhigh_noedits.pdf"
a PDF file, pages all rotated up

2 the upload file was titled "Chap1-Rotated_OCRprep.pdf"
same PDF file, pages all rotated up, but with comment blocks added to get OCR to ignore the page graphics/images...

Mon Jan 02, 2023 6:37 pm

Hello, makesdocs

Unfortunately I only see the same as you do above, we seem to only have the "no-edits" version of the file. Could you try uploading the other file again, or send it to us via email with a link to this post? our email address is support@pdf-xchange.com

Kind regards,

makesdocs · Post by **makesdocs** » Tue Jan 03, 2023 8:32 pm

hey Paul - I've uploaded the post OCR file, WITH the edits.

edits are:

updating inline text as needed. a few times so far it's been easier to take a small snapshot of a formula or fraction that is inline with text, and paste it, as the image behind the OCR text had an artifact, so the small image snapshot would hide that artifact and allow a readable word/equation/fraction etc...

edits in our vernacular can also be a snapshot of a diagram, if it was hopelessy munged, or if we were lucky, just a removal of the comment box and any artifact white boxes that were in front of the image...

Post by **Paul - PDF-XChange** » Wed Jan 04, 2023 4:50 pm

Hi makedocs,

the annotation over the base content is indeed easily removed, showing the underlying content.

You can select an image and make it "Base Content" and in this case it then does cover what was underneath.:

image.png

Will that give you what you seek?

makesdocs · Post by **makesdocs** » Thu Jan 05, 2023 5:56 pm

"Will that give you what you seek?"

I don't know, but Im going to try that, see if it works lol!

thanks again
David

Thu Jan 05, 2023 6:47 pm

makesdocs · Post by **makesdocs** » Thu Jan 12, 2023 4:06 pm

so I've had a chance to do some further investigation

the "artifacts" (empty? boxes that I think you are calling annotations) are for sure only coming up via the OCR effort. they don't exist anywhere in the capture / image refinement / pdf conversion / page rotation process chain - they only arise after the OCR. This is either with or without putting comment boxes over images to exclude them from OCR.

some observations questions:
1) even with a comment box covering an image completely, sometimes a portion of an image has OCR applied to it... it doesn't seem to be consistent across images, or even within a page, without chaning settings... - so far the solution is, for these instances, to just place a copy of the original image over it
2) I wasn't getting anywhere trying the "zonal OCR" to capture forumlas... do I do this BEFORE I OCR an entire document? if so, how does that output get merged into document?
3) similarly, I wasn't having any luck using the annontation via image select/make base content to flatten selected comments. does this get performed on the "discovered" annotations post OCR, and somehow brings the image to the foreground?

lastly, I purchased a license for the tool, since it's great and your support has also been terrific. one last question though:
4) if we have a document that was processed with a demo version, that has the DEMO VERSION stamps in the upper corners of the document, how do we get rid of those now that we have a fully licensed copy?

thanks again for all your help
best
David

thanks again

Thu Jan 12, 2023 4:53 pm

Hello makesdocs,

I will use the same notation as you did in the post above to keep my replies easy to follow
1) Can you prepare a sample for this - we'd like to see if we can reproduce this!
2) Yes - if you do the formula zonal OCR before OCRing the rest of the file - these portions of your page would no longer contain images, and as such would be skipped by the main OCR process that is run on the whole file (and maybe does not include the 'Mathematical Formulas' language).
3) Flattening comments into base content objects is a process on it's own and you should be able to flatten annotations at any point - but the flattened objects would then be treated differently than actual annotations for when you run the OCR and it tries to determine what to skip.
4) Try the Watermarks -> Remove All... inside the Editor and see if that helps!

Kind regards,
Stefan

makesdocs · Post by **makesdocs** » Sat Jan 14, 2023 5:06 pm

you folks are so responsive! thanks for the additional clarity I'll give these a shot and report back.

makesdocs · Post by **makesdocs** » Sat Jan 14, 2023 7:06 pm

I will upload the files for you all to use to try out and replicate our issues later tonight or tomorrow

for now, here is a document with screenshots from the scanning effort that shows you step by step what was happening

1st section is an example of my attempt at "zonal" OCR - not really working

PDF-X_OCRissues.pdf

2nd section is an example of an image with inconsistent OCR occurring on images
2a: "in-image text" not being consistently excluded from OCR process while being under the comment box area
2b: the OCR process also creating quite a few (empty? annotations?) artifact boxes that are laid over the image that was have to be removed to be able to fully view the image in the OCR'd file

makesdocs · Post by **makesdocs** » Sun Jan 15, 2023 3:38 pm

hey folks - got the example files that match the previous post's example document, so you all can try to replicate/examine what is going on

1) Chap6-Rotated
this is the scanned images, massaged for luminosity etc., pages rotated to all be in correct orientation.
2) Chap6-Rotated_OCRprep
this is the Chap6-Rotated file, with comment text boxes added throughout the document in order to try and exclude the underlying content from OCR (let the image of the actual content be included)
3) Chap6-Rotated_OCRprephigh_edited
this is the Chap6-Rotated_OCRprephighedited file, that has been OCR'd using "high" OCR setting, and then undergone the process of removing all comment boxes, "annotation artifacts" and all sorts of editing to try and clean up OCR content and/or re-insert original content as images to maintain final output integrity with input document

Tue Jan 17, 2023 6:12 pm

Hello makesdocs,

Apologies for the delay in getting back to you on this topic!
I've had a go at your sample files - and indeed nothing seems to be the "fit all" solution.
Your files are scanned books with text and graphics in them, and there are both formulas and text over images (which you'd like to not process).
So I can't really come up with a universal solution that would allow you to process e.g. 20 such books within a few hours.
But if it's a one off - then a semi manual approach might give you the best results.

You can e.g. make copies of pages that have those graphics - and export those to a separate file.
Then OCR the original, remove 'messed up' pages where the OCR did affect your images, and then re-import the copies of those pages from the other file.
Once that is done - you can then do manual ocr on only those complex pages - e.g. you can try the "zonal OCR" - but this time on the actual text blocks, so that only they are processed.
And finally manually typing in those formulas that are in the main text if necessary.

That is the workflow I could come up with that will give the best possible result while converting the most amount of scanned images to actual text, but it is definitely not an automated approach.

Kind regards,
Stefan

images with text

images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text

Re: images with text