Page 1 of 1

Issue with text selection and text blocks

Posted: Tue Jun 15, 2021 3:32 pm
by SMan
Hello all,

in the software that our company makes, PDF-XChange Editor SDK is used to display PDF files, generally to everyone's satisfaction.

Recently, customer feedback has made us aware of an issue that we have displaying certain auto-generated documents with the Editor SDK (also reproducible with the standalone Editor). These PDF documents result from scans processed by OCR software. A sample document (showing no customer data but that of our own company) is attached.

In these documents, the OCR software has turned some table borders (vertical ones, in particular) into very high text blocks. Because of that, attempting to manually select the text of certain lines with the text selection tool results in highlighted areas much higher than the actual text, and in certain cases then you can't avoid selecting other, unwanted text along with it, or you are prevented from selecting the text you actually want because it is "hidden" "behind" another text block. Other PDF display software (Adobe, Chrome, Firefox...) we tested doesn't show this behavior.

In the example file: Try selecting the line containing the word "Gesamtbetrag". Then try just selecting the total "1703,05 €".

If we manually edit the file with the PDF-XChange Editor (via the content pane), we can access the various text blocks of the text layer that the OCR software has added. Some of these have no (printable, non-whitespace) text content, and selecting these in the content pane, we can see that these each align with one of the table's grid lines. If we delete these manually (in particular the ones aligning with vertical grid lines), then afterwards selecting text works as expected. But manually editing the PDF files is of course not a viable solution for our customers.

Is there a fix for this behavior, or maybe an option to disregard text blocks without printable content for manual text selection?

Thanks for any assistance.

example.pdf

Re: Issue with text selection and text blocks

Posted: Thu Jun 24, 2021 12:31 am
by Vasyl - PDF-XChange
Hi SMan.

Sorry for delay with answer.

As I understand, the example document OCR'ed without using the Editor's OCR? Because with our OCR we couldn't get the same wrong text-content as it is in your example.pdf...

Anyway - seems we have trouble with our text selection mechanism on some 'strange' text contents. Will try to fix it on the near future.

Cheers.

Re: Issue with text selection and text blocks

Posted: Thu Jun 24, 2021 1:48 pm
by SMan
Hello Vasyl,

Thank you for your reply. Yes, the OCR was done by third party software, not the Editor's OCR.

Looking forward to the fix in the XChange Editor (SDK).

Best regards,
Sven (SMan)

Re: Issue with text selection and text blocks

Posted: Tue Jun 29, 2021 7:50 am
by Sasha - PDF-XChange
Hello SMan,

Well the problem is in the incorrect text blocks that were created by the 3rd party OCR software - thus the text selection does not work as intended. Please try using our OCR engine for this - I'm sure it will give better results.
Other then that - we'll have to wait for Vasyl on this one.

Cheers,
Alex