Issue with text selection and text blocks
Posted: Tue Jun 15, 2021 3:32 pm
Hello all,
in the software that our company makes, PDF-XChange Editor SDK is used to display PDF files, generally to everyone's satisfaction.
Recently, customer feedback has made us aware of an issue that we have displaying certain auto-generated documents with the Editor SDK (also reproducible with the standalone Editor). These PDF documents result from scans processed by OCR software. A sample document (showing no customer data but that of our own company) is attached.
In these documents, the OCR software has turned some table borders (vertical ones, in particular) into very high text blocks. Because of that, attempting to manually select the text of certain lines with the text selection tool results in highlighted areas much higher than the actual text, and in certain cases then you can't avoid selecting other, unwanted text along with it, or you are prevented from selecting the text you actually want because it is "hidden" "behind" another text block. Other PDF display software (Adobe, Chrome, Firefox...) we tested doesn't show this behavior.
In the example file: Try selecting the line containing the word "Gesamtbetrag". Then try just selecting the total "1703,05 €".
If we manually edit the file with the PDF-XChange Editor (via the content pane), we can access the various text blocks of the text layer that the OCR software has added. Some of these have no (printable, non-whitespace) text content, and selecting these in the content pane, we can see that these each align with one of the table's grid lines. If we delete these manually (in particular the ones aligning with vertical grid lines), then afterwards selecting text works as expected. But manually editing the PDF files is of course not a viable solution for our customers.
Is there a fix for this behavior, or maybe an option to disregard text blocks without printable content for manual text selection?
Thanks for any assistance.
in the software that our company makes, PDF-XChange Editor SDK is used to display PDF files, generally to everyone's satisfaction.
Recently, customer feedback has made us aware of an issue that we have displaying certain auto-generated documents with the Editor SDK (also reproducible with the standalone Editor). These PDF documents result from scans processed by OCR software. A sample document (showing no customer data but that of our own company) is attached.
In these documents, the OCR software has turned some table borders (vertical ones, in particular) into very high text blocks. Because of that, attempting to manually select the text of certain lines with the text selection tool results in highlighted areas much higher than the actual text, and in certain cases then you can't avoid selecting other, unwanted text along with it, or you are prevented from selecting the text you actually want because it is "hidden" "behind" another text block. Other PDF display software (Adobe, Chrome, Firefox...) we tested doesn't show this behavior.
In the example file: Try selecting the line containing the word "Gesamtbetrag". Then try just selecting the total "1703,05 €".
If we manually edit the file with the PDF-XChange Editor (via the content pane), we can access the various text blocks of the text layer that the OCR software has added. Some of these have no (printable, non-whitespace) text content, and selecting these in the content pane, we can see that these each align with one of the table's grid lines. If we delete these manually (in particular the ones aligning with vertical grid lines), then afterwards selecting text works as expected. But manually editing the PDF files is of course not a viable solution for our customers.
Is there a fix for this behavior, or maybe an option to disregard text blocks without printable content for manual text selection?
Thanks for any assistance.