[Bug] Extra Spaces in Copied Text

The PDF-XChange Viewer for End Users
+++ FREE +++

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

lbdyck
User
Posts: 58
Joined: Thu Feb 07, 2008 3:34 pm

[Bug] Extra Spaces in Copied Text

Post by lbdyck »

When I attempt to copy some text (code example) I am seeing extra spaces when I paste the info. See the attached zip file for the sample page from the pdf and a text file showing the results of pasting.
You do not have the required permissions to view the files attached to this post.
----------
Lionel B. Dyck <><
quant
User
Posts: 151
Joined: Fri Jan 18, 2008 2:48 pm

Post by quant »

This is in fact happening in quite many pdf's. I had similar problems with another pdf software. The developer's reply was sth like ... sometimes there is no space between words (in the pdf file structure), so the program has to guess as to how many spaces should be there in the extracted text, and whether there should be any space at all ... not easy
Podhorny
User
Posts: 88
Joined: Tue Oct 09, 2007 8:03 am

Post by Podhorny »

It looks correct, in PDF there always 2 spaces, try to copy also text above - it has only one space between words:

Code: Select all

You must update the PROFILE EXEC for any user ID that will be running
GOMMAIN. The default user ID is OPMGRM1. The PROFILE EXEC for these user
IDs should include the following statements:
/*  Sample  lines  to  include  in  OPMGRM1  PROFILE  EXEC  */
’CP  SET  RUN  ON’
’ACCESS  194  D’
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3603
Joined: Thu Jul 08, 2004 10:36 pm

Post by Ivan - Tracker Software »

To be honest, into your PDF sample text contains two spaces between words -- try to use Text selection tool and you will see this.
PDF-XChange Co Ltd. (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
quant
User
Posts: 151
Joined: Fri Jan 18, 2008 2:48 pm

Post by quant »

Ivan - Tracker Software wrote:To be honest, into your PDF sample text contains two spaces between words -- try to use Text selection tool and you will see this.
That is exactly the point. How do you know that there are two spaces? Visually it looks like, but that line could just be made to "fill the line" (as opposed to be left aligned), so there is still one space between the words, it's just a bit bigger.

See the example I provide. The comparison of extracted text from Adobe and pdf-xchange:

... no comparison, because this editor removes double spaces, haha, this is funny. OK, extract this sentence:

"Recently, the characterization of q-optimal equivalent martingale mea-
sures in market models with jumps has been studied in several papers."

You will see double space extracted in pdf-xchange, but there are no double spaces in the original text, neither in Adobe, it's just that the line is filled.
You do not have the required permissions to view the files attached to this post.
User avatar
Ivan - Tracker Software
Site Admin
Posts: 3603
Joined: Thu Jul 08, 2004 10:36 pm

Post by Ivan - Tracker Software »

There are no spaces at all into the text you send. This text specified into the PDF into the following way:

Code: Select all

[(Recen)27(tly)84(,)-495(the)-464(c)27(haracterization)-462(of)]TJ
/F2 10.91 Tf
164.08 0 TD
[(q)]TJ
/F7 10.91 Tf
5.26 0 TD
[(-optimal)-462(equiv)54(alen)28(t)-462(martingale)-464(mea-)]TJ
please note numbers like -464, -462, etc. - they means distance between pieces of text. Viewer analyzes this and convert to spaces. Number of spaces depends of font's metric.
PDF-XChange Co Ltd. (Project Director)

When attaching files to any message - please ensure they are archived and posted as a .ZIP, .RAR or .7z format - or they will not be posted - thanks.
quant
User
Posts: 151
Joined: Fri Jan 18, 2008 2:48 pm

Post by quant »

Ivan - Tracker Software wrote:please note numbers like -464, -462, etc. - they means distance between pieces of text. Viewer analyzes this and convert to spaces. Number of spaces depends of font's metric.
OK, but you see yourself that

"Viewer analyzes this and convert to spaces. Number of spaces depends of font's metric."

is probably not the best way to go about this. Clearly, the original author didn't put 2 spaces between words, it's just that the line was filled. I would think that "intelligent viewer/text extractor" would take this into account, and not merely apply the hardcoded formula

number of spaces = distance / (font metric)
lbdyck
User
Posts: 58
Joined: Thu Feb 07, 2008 3:34 pm

spacing

Post by lbdyck »

Just comparing Viewer to Acrobat Reader the Reader does correctly handle the spaces (*but fails with the new line).
----------
Lionel B. Dyck <><
User avatar
John - Tracker Supp
Site Admin
Posts: 5225
Joined: Tue Jun 29, 2004 10:34 am

Post by John - Tracker Supp »

Hi,

We are looking at adding an optional feature that will handle both ways - so as not create an issue if we handle the space duplication that will impact other situations - which it very easily could.

HTH
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com