Hi,
I have a big problem using text extraction.
Problem 1: I get words with spaces between each character
Problem 2: Words are enreadable because the charcters are shuffelt
In the appending file you can see the original pdf and the resulting text file and the result created by Adobes Acrobat.
I already tried out setting the PXP_TETextComposeOptions MergeDistanceX. Not better...
What's wrong?
Best regards
cew3
Problems with spaces in extracted texts
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange
-
- User
- Posts: 213
- Joined: Tue Feb 01, 2011 8:14 am
Problems with spaces in extracted texts
You do not have the required permissions to view the files attached to this post.
-
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: Problems with spaces in extracted texts
Problem is because your file contain fonts with non-standard encodings, so xcpro cannot correctly determine characters widths (especially space width) and therefore cannot compose letters correctly (yes, in this document each letter is displayed separately, so text composing is not so simple).
I'm affraid that I cannot say when this can be fixed, however you have possibility to correct this yourself. Please use PXCp_ET_GetElement function instead of PXCp_ET_GetPageContentAsTextW and then compose text as you wish.
HTH.
I'm affraid that I cannot say when this can be fixed, however you have possibility to correct this yourself. Please use PXCp_ET_GetElement function instead of PXCp_ET_GetPageContentAsTextW and then compose text as you wish.
HTH.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
-
- User
- Posts: 213
- Joined: Tue Feb 01, 2011 8:14 am
Re: Problems with spaces in extracted texts
How do I use PXCp_ET_GetElement to get the complete text of my document?Lzcat - Tracker Supp wrote:I'm affraid that I cannot say when this can be fixed, however you have possibility to correct this yourself. Please use PXCp_ET_GetElement function instead of PXCp_ET_GetPageContentAsTextW and then compose text as you wish.
I couldn't find any demo c# source where this function is used.
Best regards
cew
-
- User
- Posts: 664
- Joined: Tue Nov 14, 2006 12:23 pm
Re: Problems with spaces in extracted texts
Hi cew,
Please check attached example (frmTextExtraction.cs -> btnSimple_Click).
HTH.
Please check attached example (frmTextExtraction.cs -> btnSimple_Click).
HTH.
You do not have the required permissions to view the files attached to this post.
-
- User
- Posts: 213
- Joined: Tue Feb 01, 2011 8:14 am
Re: Problems with spaces in extracted texts
Thank you for the code example.Corwin - Tracker Sup wrote: Please check attached example (frmTextExtraction.cs -> btnSimple_Click).
But this way is incredible slow:
Extracting an pdf with about 10 mb takes over 7 minutes.
Using PXCp_ET_GetPageContentAsTextW takes about 15 seconds.
Best
cew
-
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: Problems with spaces in extracted texts
Then you will need to optimize it. This is just a sample, without any optimizations - as I can see there are a lot of memory allocation/deallocation operations (two per each text element), this may be a bottleneck.
Also, xcpro compose text inside, without making copies as PXCp_ET_GetElement does, so your code will be little slower, sorry.
Also, xcpro compose text inside, without making copies as PXCp_ET_GetElement does, so your code will be little slower, sorry.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.