Editor SDK OCR iussues

kyo · Post by **kyo** » Sat Mar 04, 2017 3:11 am

Dear all,
Our company had purchased your PDF-XChange Editor SDK, and now I need your assistance.

When I did OCR using "op.document.OCRPages" operation, I encountered some confusing issues,as shown below:
1.If I execute the "op.document.OCRPages" operation a few times, and it would puts accordance count of layers on the doc(IPXC_Document).And then If I convert this doc(IPXC_Document) to word document(.doc/.docx), I would found the .doc/.docx with many duplicated layers.

so, I want to know what cause this behavior? and how can I avoid this?

2. If I convert an ocred pdf(image only pdf) to word (.doc/.docx), I would find the .doc/.docx has two layers with the image layer on the top and the text layer on the behind.The problem is that the recognized text of the text layer is hidden,which I expect to be shown.
so,I want to know how to show the recognized text not to hide them?

Thanks in advance.

Sat Mar 04, 2017 8:31 am

Hello Kyo,

When you perform the OCR process - with the correct settings it will take the current file contents - and do OCR on it.
It will then add the new OCR text layer on top of anything existing - without removing anything - that is why you end up with multiple layers of invisible text - as each OCR operation adds it's own layer on top of all the content already in the file.
The OCR process can not recognize if any of the existing content is already an OCR layer - and just adds it's own on top.

The Conversion from PDF to Word is handled by Word APIs, so why do they put the image on top of the text is beyond me. Also - the text will normally be invisible, so you need to make it black inside the Editor first before conversion to word (and you can also remove the image if desired).

Regards,
Stefan

kyo · Post by **kyo** » Mon Mar 06, 2017 4:01 am

Dear Stefan,
Thank you for your reply.
I have almost understood what you mean.
So,Do you have a workaround to ensure only one ocred layer is added on the top of the pdf?

Please give me some advice about how to handle this in my program.
Any advice will be greately apprieciated.

Thanks.

Post by **Will - Tracker Supp** » Mon Mar 06, 2017 9:12 am

Hi kyo,

If you're running the OCR operation multiple times, there isn't any way to ensure that only one text layer, in total, is added. As Stefan said, it is impossible to differentiate between standard text and OCR'd text. The only way to avoid this is to only OCR a document once, or delete the text layers before running another OCR operation.

Thanks,

Editor SDK OCR iussues

Editor SDK OCR iussues

Re: Editor SDK OCR iussues

Re: Editor SDK OCR iussues

Re: Editor SDK OCR iussues