Issue inserting space when merging multi text elements inside app for languages that don't include space between words

Forum for the PDF-XChange Editor - Free and Licensed Versions

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

User avatar
rakunavi
User
Posts: 1825
Joined: Sat Sep 11, 2021 5:04 am

Issue inserting space when merging multi text elements inside app for languages that don't include space between words

Post by rakunavi »

Hello all,

Languages can be divided into two groups: those with spaces between words, such as English, and those without spaces between words, such as Japanese and Chinese. When PDF-XChange Editor treats multiple text elements in a PDF as flow text without line breaks, it looks as if they are simply combined with spaces between them, without taking into account the differences in characteristics between languages. While this is fine for most languages, it is somewhat problematic for languages such as Japanese and Chinese, which do not have spaces between words, and I report below.

Each issue is independent of the other, and I have compared PDF files created from Japanese, Traditional Chinese and Latin Alphabet text files using "Convert Text Files to PDF" feature in PDF-XChange Editor. For Japanese and Latin alphabet, the PDF was output with default settings. For Traditonal Chinese, "New Paragraph Mode" was set to "Each newline character starts a new paragraph".

  • Sample Files.zip
  • summary.png
  • Source Text.png
  • convertedPDF.png

  • Issue 1: Saving as plain text

    When saving as a plain text file, even though the "Insert line breaks" option is disabled and the "Insert breaks after each paragraph" option is enabled, spaces are inserted on all lines and line breaks are not inserted after each paragraph. If the same settings are tried for Latin Alphabet, the output text will not contain unnecessary spaces, and line breaks will be inserted correctly for each paragraph.

    • TextConverterOption.png
    The following figure compares the three results, from left to right: Japanese, Traditional Chinese, and Latin Alphabet. The highlighted orange areas for Japanese and Traditional Chinese are spaces that are not needed. In contrast, the spaces highlighted in yellow for Latin Alphabet are necessary and properly placed. Please also note that the line breaks in each paragraph are not included in Japanese and Traditional Chinese, but are included in Latin Alphabet.

    • TextConvertComparing.png
    Given the process of converting from text to PDF and then back to text again, I would like the original text and the last generated text to be basically equivalent.

  • Issue 2: Export to Word document

    In the options for exporting to a word document, even though the layout setting is set to "Retain Flowing Text", spaces are inserted on every line, and line breaks are not inserted after each paragraph. On the other hand, when you try the same settings for Latin Alphabet text, the output file does not contain any unnecessary spaces, and line breaks will be inserted correctly for each paragraph. Although there is a difference between a text file and a Word document, basically the same thing applies as described above for a text file, so I will not go into details.

    • WordExportOption.png

  • Issue 3: Read Selected Text Out Loud

    When reading selected text out loud from the Accessibility tab, some SAPI text-to-speech engines produce unnatural silence intervals on each line. In the Latin Alphabet, the text is played back naturally as flow text, even with lines in between. For some text-to-speech engines, it seems that the text-to-speech engine itself removes space, and only in such cases the speech is played back naturally without unnatural silence intervals.

    In the verification shown in the waveform diagrams below, the same Voice engine (VW Misaki) was specified in Acrobat Reader DC and PDF-XChange Editor, and the first and second paragraphs of a Japanese PDF file were read out loud. The yellow markers in the PDF-XChange Editor waveform indicate unnatural silence intervals, which correspond to the red line in the Acrobat Reader DC waveform. At the same time, it also corresponds to the same numbered section shown in the Japanese sample.

    • ReadOutLoud.png

  • Issue 4: Advanced Search

    If a newline is included in the result of a search, it is displayed as a space. The effect is less severe than the above three, since only the results are displayed, but the search results are slightly more difficult to read. In Latin Alphabet, search results do not contain unnecessary spaces and are easy to read.

    In the video, I selected the appropriate characters in the Japanese and Traditional Chinese texts, respectively, and performed an advanced search on them. Note the red box in each search result. You will notice that there is an unnecessary space corresponding to a line break in the original PDF file.

    • CapturedVideo.zip
    • SearchResult.png

Above four issues are reported separately, but the root cause might be the same. Incidentally, if you activate edit mode, select a block of text, and copy and paste it to the clipboard, the text does not contain any wasted spaces, even in the current build 368.

Hoping that the above information will be of some help to you.
Thank you so much for your continued support.

Best regards,
rakunavi

- PDF-XChange Editor Plus Version: 9.5 build 368.0
- OS Version: Windows 11 Home 22H2 Build 22621.1555
- PC Model: Lenovo IdeaPad C340-15IWL, HP All-in-One 22-c0xx
You do not have the required permissions to view the files attached to this post.
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11586
Joined: Wed Jan 03, 2018 6:52 pm

Re: Issue inserting space when merging multi text elements inside app for languages that don't include space between wor

Post by Daniel - PDF-XChange »

Hello, rakunavi

Thank you very much for the detailed post, I have consolidated this into a single bug report, hopefully these items can all be addressed, but I cannot speak to how high of a priority it will be.

RT#6440: Bug: Issues with un-spaced languages.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
User avatar
rakunavi
User
Posts: 1825
Joined: Sat Sep 11, 2021 5:04 am

Re: Issue inserting space when merging multi text elements inside app for languages that don't include space between wor

Post by rakunavi »

Hi Daniel, thank you for creating the ticket.
TrackerSupp-Daniel wrote: Mon Apr 17, 2023 6:03 pm RT#6440: Bug: Issues with vertical/un-spaced languages.
Regarding the title of the ticket, all of above report is assumed to be written horizontally, the same as the Latin alphabet.

This is by no means to say that issues do not occur in vertical text. The same issues do occur in vertical text, and in fact they can be rather complicated. However, considering the number of tickets you currently have, there is no merit in devoting a few elite development resources to it. I hope you will work on improving issues on horizontal text. I am not that greedy. :wink:

Please give my best regards to the developer.

Best regards,
rakunavi
TOP desires for PDFXCE
forum.pdf-xchange.com/viewtopic.php?t=39665 LassoTool
forum.pdf-xchange.com/viewtopic.php?t=38554 CmtGarbled
forum.pdf-xchange.com/viewtopic.php?t=37353 FulScrMultiMon
forum.pdf-xchange.com/viewtopic.php?t=41002 DisableTouchSelect
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 11586
Joined: Wed Jan 03, 2018 6:52 pm

Re: Issue inserting space when merging multi text elements inside app for languages that don't include space between wor

Post by Daniel - PDF-XChange »

Hello, rakunavi

My apologies, somehow I missed that after the other issues you raised recently focused on vertical languages, it seems that in creating the title, I got the detailed jumbled. Thankfully, the ticket is simply a direct link to this topic, so there is no need to worry about misunderstanding on that end, I will fix the title momentarily.

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com