Problems with spaces in extracted texts

PDF-XChange Viewer SDK for Developer's
(ActiveX and Simple DLL Versions)

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

cew
User
Posts: 213
Joined: Tue Feb 01, 2011 8:14 am

Problems with spaces in extracted texts

Post by cew »

Hi,

I have a big problem using text extraction.
Problem 1: I get words with spaces between each character
Problem 2: Words are enreadable because the charcters are shuffelt

In the appending file you can see the original pdf and the resulting text file and the result created by Adobes Acrobat.

I already tried out setting the PXP_TETextComposeOptions MergeDistanceX. Not better...

What's wrong?

Best regards
cew3
You do not have the required permissions to view the files attached to this post.
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Problems with spaces in extracted texts

Post by Lzcat - Tracker Supp »

Problem is because your file contain fonts with non-standard encodings, so xcpro cannot correctly determine characters widths (especially space width) and therefore cannot compose letters correctly (yes, in this document each letter is displayed separately, so text composing is not so simple).
I'm affraid that I cannot say when this can be fixed, however you have possibility to correct this yourself. Please use PXCp_ET_GetElement function instead of PXCp_ET_GetPageContentAsTextW and then compose text as you wish.
HTH.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
cew
User
Posts: 213
Joined: Tue Feb 01, 2011 8:14 am

Re: Problems with spaces in extracted texts

Post by cew »

Lzcat - Tracker Supp wrote:I'm affraid that I cannot say when this can be fixed, however you have possibility to correct this yourself. Please use PXCp_ET_GetElement function instead of PXCp_ET_GetPageContentAsTextW and then compose text as you wish.
How do I use PXCp_ET_GetElement to get the complete text of my document?
I couldn't find any demo c# source where this function is used.

Best regards
cew
Corwin - Tracker Sup
User
Posts: 664
Joined: Tue Nov 14, 2006 12:23 pm

Re: Problems with spaces in extracted texts

Post by Corwin - Tracker Sup »

Hi cew,

Please check attached example (frmTextExtraction.cs -> btnSimple_Click).

HTH.
You do not have the required permissions to view the files attached to this post.
cew
User
Posts: 213
Joined: Tue Feb 01, 2011 8:14 am

Re: Problems with spaces in extracted texts

Post by cew »

Corwin - Tracker Sup wrote: Please check attached example (frmTextExtraction.cs -> btnSimple_Click).
Thank you for the code example.

But this way is incredible slow:
Extracting an pdf with about 10 mb takes over 7 minutes.
Using PXCp_ET_GetPageContentAsTextW takes about 15 seconds.

Best
cew
User avatar
Lzcat - Tracker Supp
Site Admin
Posts: 677
Joined: Thu Jun 28, 2007 8:42 am

Re: Problems with spaces in extracted texts

Post by Lzcat - Tracker Supp »

Then you will need to optimize it. This is just a sample, without any optimizations - as I can see there are a lot of memory allocation/deallocation operations (two per each text element), this may be a bottleneck.
Also, xcpro compose text inside, without making copies as PXCp_ET_GetElement does, so your code will be little slower, sorry.
Victor
Tracker Software
Project manager

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.