OCR isn't consistently pickup up hierarchical formatting consistently

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

OCR isn't consistently pickup up hierarchical formatting consistently

Post by makesdocs »

Here's an example of a source book scan (an older manufacturing reference book), side by side with the OCR'd version.
The OCR conversion deskewed fairly well and did a pretty good job for character recognition (especially at the margins where there is the challenge with the bound book on a flatbed scanner)
It didn't recognize the sentence in bold type opening this paragraph. It has pickedup most of this, but not all of it in our 20 or so pages scanned. This book is written with a lot of hierarchical formatting like this, and losing it is not ideal.
image.png
is there a way to tune the OCR to capture more of this, or edit the font for selected text in the document if not?

sorry if this is a newbie question, I tried searching for posts on this but didn't find anything that seemed to relate to the issue. we just downloaded the demo, and are trying to get through evaluation (we are pleased so far, but do have questions!)
You do not have the required permissions to view the files attached to this post.
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19913
Joined: Mon Jan 12, 2009 8:07 am

Re: OCR isn't consistently pickup up hierarchical formatting consistently

Post by Stefan - PDF-XChange »

Hello makesdocs,

It is up to the OCR engine (and our Enhanced OCR uses ABBYY's Fine Reader engine) to recognize and put the font weight in the recognized text. Apparently with your sample our OCR does not recognize the bold there - and puts all the text with standard weight. There isn't really a way to tune the recognition engine, or train it on your or our end, and as such the option I see here is for you to manually make that text bold after the OCR has processed your initial file.
Here's how you can edit text inside a PDF file with the Editor:
https://www.pdf-xchange.com/knowle ... -documents

Kind regards,
Stefan
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: OCR isn't consistently pickup up hierarchical formatting consistently

Post by makesdocs »

copy that

can pdf-xchange handle formulae built in say, MS-Word, or do I need to edit using a graphic and insert?
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7388
Joined: Wed Mar 25, 2009 10:37 pm

Re: OCR isn't consistently pickup up hierarchical formatting consistently

Post by Paul - PDF-XChange »

Hi makesdocs,

I'm not sure I completely understand what you mean, if you want to send a sample page with the content you are referring to we can investigate if there are issues.

regards
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: OCR isn't consistently pickup up hierarchical formatting consistently

Post by makesdocs »

Hey Paul - I was thinking, perhaps I can re-create the formula using the MS-Word equation editor (which is I think formatted text, vs. an image), or anthoer in a math editor
and then insert that into the PDF file using text editing functions, IFF PDF-Xchange supported that

otherwise I can create the formula, and insert as an image, but that's a ton of work lol.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

Re: OCR isn't consistently pickup up hierarchical formatting consistently

Post by Daniel - PDF-XChange »

Hello, makesdocs

If you mean something like mathematical symbol formula, as in an appearance of the text, than yes, we should be able to handle it, our software has access to the same font libraries that are available to MS Word. That being said, you will likely want to use the image approach anyway, as if someone without access to that font open the file, they may be unable to see the text properly.

I should also note that taking a quick screenshot anywhere within windows is easy these days, simply press Win+Shift+S to take a snip of a section of the screen, then you only need to use Ctrl+V to paste that image into our Editor, no need to use the insert images function specifically.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
makesdocs
User
Posts: 32
Joined: Sun Dec 25, 2022 7:28 pm

Re: OCR isn't consistently pickup up hierarchical formatting consistently

Post by makesdocs »

copy yeah Im figuring out how to work with it all

thanks so much for all your help the forum has been very helpful and responsive
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11577
Joined: Wed Jan 03, 2018 6:52 pm

OCR isn't consistently pickup up hierarchical formatting consistently

Post by Daniel - PDF-XChange »

:)
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com