convert to TXT : sticky or cut out lines?

The PDF-XChange Viewer for End Users
+++ FREE +++

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

convert to TXT : sticky or cut out lines?

Post by jakesp »

The original file contains cases made of a relatively stable header (an identifier) followed by a variable number of references (e.g. page numbers)
In the original pdf, a case may go over two lines (many references).
If I check "detect paragraph", the original cases are well restored (multiple-line cases on one line) BUT there are many "sticky" lines (2 or 3 of the original cases glued together)
If I uncheck "detect paragraph", there are no more sticky lines but the long original cases are cut in lines as they are laid out in the document.
I wish I could get the best of both options.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm

Re: convert to TXT : sticky or cut out lines?

Post by Will - Tracker Supp »

Hi jakesp,

Thanks for the post, however, I'm not entirely sure what is meant here. Could you send a sample file and screen-shot that illustrates the issue?

Cheers, let me know!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

Re: convert to TXT : sticky or cut out lines?

Post by jakesp »

1 - Original file Stef_index.pdf
2 - Stef_index_txt.txt
“long” cases are correct ex.: Achon Ozanne-Anne around the 10th line
“sticky” lines are present ex Bélanger Marie Françoise Charlotte and Belou Jacques (before/after second page break)
3 - Stef_index_txt1.txt
« long » cases are “cut” into lines ex.: Achon Ozanne-Anne around the 10th line
no “sticky” lines

N.B. 2 and 3 are in the zip file

End of pages seen as blank lines (not very much appreciated)
You do not have the required permissions to view the files attached to this post.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm

Re: convert to TXT : sticky or cut out lines?

Post by Will - Tracker Supp »

Hi jakesp,

Thanks for that, however, I'm afraid that I still don't understand the issue, as the documents appear to render as the text files suggest that they should. Could you please send some screen-shots that show what the issue is? Also, could you please provide clearer instructions to reproduce the issue?

Cheers, let me know!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

Re: convert to TXT : sticky or cut out lines?

Post by jakesp »

I forgot to zip the pdf file. Here it is
You do not have the required permissions to view the files attached to this post.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm

Re: convert to TXT : sticky or cut out lines?

Post by Will - Tracker Supp »

Hi Jake,

I'll actually need a screen-shot demonstrating the issue, as I'm not seeing any inconsistencies between the text file and the PDF itself, so could you send in the screen-shot that clearly demonstrates the issue that you're referring to.

Cheers,
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

Re: convert to TXT : sticky or cut out lines?

Post by jakesp »

I have documented in the attached Word file occurrences of the 3 "problems", "long cases", "sticky lines" and "page breaks" in the original pdf and in TXT and TXT1 (difference due to the paragraph end option). You could find many other examples by just comparing the 3 files.
You do not have the required permissions to view the files attached to this post.
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm

Re: convert to TXT : sticky or cut out lines?

Post by Will - Tracker Supp »

Hi Jake,

Thanks for that - I understand now. We'll definitely consider the suggestion, when it comes time to implemnent new features, though I cannot give a definite promise that the feature will be implemnted.

Cheers, hope that helps!
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

Re: convert to TXT : sticky or cut out lines?

Post by jakesp »

I would love to know what is the new "feature" you are suggesting? If it is about the blank line added at page break, I would understand (a year ago I have already pointed at that "feature" and got .. the same kind of answer)

But about the "sticky" lines, I would think it is a bug excepted if you have found out that the original pdf was at fault, and you do not mention it. In that case, the ball would remain with Tracker Software because the pdf has been generated by "PDFXchange for geneatique"
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm

Re: convert to TXT : sticky or cut out lines?

Post by Will - Tracker Supp »

Hi jakesp,

Perhaps I still need further clarification on what you mean by "sticky lines", etc., as it seems to me that the driver is doing everything that it should be, from my understanding of the issue, hence why I believed that you were asking for a new feature or addition to an existing feature.

Please clarify what exactly is meant by "sticky or cut out lines" and provide images that clearly demonstrate what exactly is sticking that shouldn't be, or what is cut out and shouldn't be - I'm afraid that neither myself, nor any of my colleagues understand quite what you mean, at this point.

Thanks, I look forward to hearing back from you.
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

Re: convert to TXT : sticky or cut out lines?

Post by jakesp »

Let me explain again what I have shown in PDFTOOLS_example1.docx (in the zip file sent sometimes ago).

"sticky lines" are two (sometimes 3) "pdf original logically distinct" lines glued together as a continuous line in the txt output. They exist if I check "Detect paragraph" in Text save type set-up (I refer to that type of output as TXT).

The advantage of that option is that it respects the "original logically continuous" lines (logical records extending over several pdf lines).

If I do not check "Detect paragraph", I do not find "sticky lines"; each pdf line gives the corresponding txt line, even with the "original logically continuous" lines.

I do not ask for a new feature; I just noticed the effects of the "detect paragraph" option. Either it is some kind of PDFTools malfunctioning (end of paragraph are occasionally forgotten or misinterpreted), or the pdf original file does not contain systematically the proper code (end of paragraph).

If you opt for the second possibility, and that you find that the pdf "misses" some end of paragraph marks, then you will have to look at the way this pdf was generated : using the "PDFChange 5.0 pour Généatique"

I add here in the documentation of this case the original pdf file
You do not have the required permissions to view the files attached to this post.
jakesp
User
Posts: 7
Joined: Thu Dec 24, 2009 9:15 pm

Re: convert to TXT : sticky or cut out lines?

Post by jakesp »

Can I expect some kind of an answer or should I find another tree to bark at ...???
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19930
Joined: Mon Jan 12, 2009 8:07 am

Re: convert to TXT : sticky or cut out lines?

Post by Stefan - PDF-XChange »

Hi jakesp,

Sorry we've missed replying to your post! I will ask Will to take a look at it and give a detailed reply when he comes to work a bit later today!

Regards,
Stefan
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7445
Joined: Wed Mar 25, 2009 10:37 pm

Re: convert to TXT : sticky or cut out lines?

Post by Paul - PDF-XChange »

Hi jakesp,

Will asked me to take a look at this to get some clarification. My apologies if we are missing something obvious here.
Either it is some kind of PDFTools malfunctioning (end of paragraph are occasionally forgotten or misinterpreted), or the pdf original file does not contain systematically the proper code (end of paragraph).
I think this is the crux of the issue. I will have one of my engineers take a look at stef_index.pdf to see if there are issues with the formatting at the end of a /line/paragraph.
If you opt for the second possibility, and that you find that the pdf "misses" some end of paragraph marks, then you will have to look at the way this pdf was generated : using the "PDFChange 5.0 pour Généatique"
- indeed if there are issues with the PDF itself and it was created using PDF-XChange 5.0 we need to look into that. From what format wat the conversion done and would we be able to have access to the source document please?

To summarize, if you could confirm, we need to first ascertain if the PDF itself has a problem, if so look at the driver and why, if not look at why PDF-Tools is inconsistent with those cases.

Does that accurately describe the issue?

regards
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com