Non-English text not searchable in resulting PDF

This Forum is for the use of Software Developers requiring help and assistance for Tracker Software's PDF-Tools SDK of Library DLL functions(only) - Please use the PDF-XChange Drivers API SDK Forum for assistance with all PDF Print Driver related topics or PDF-XChange Viewer SDK if appropriate.

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange

ashmid
User
Posts: 27
Joined: Fri Jan 06, 2012 12:35 am

Non-English text not searchable in resulting PDF

Post by ashmid »

Hello,
I've just gotten started with the SDK, which I am currently evaluating it for purchase. Because we do a lot of work in non-English languages, specifically Hebrew, the first thing I tried was to create a sample document containing Hebrew characters. The document was created successfully; however, I found that I was not able to successfully search for the Hebrew characters within Adobe Acrobat. In contrast, when I create documents with English characters, or with numbers, the characters are searchable. There seems to be something about the encoding/mapping of the non-ASCII characters that is preventing a proper search.
The code I am using goes like this (I use the Arial Unicode MS font because of its wideranging support for virtually all unicode chars):
hr = PXCp_Init(&pdfdoc, NULL, NULL);
hr = PXC_NewDocument(&mydoc, NULL, NULL);
hr = PXC_AddPage(mydoc, 850, 1100, &page);
hr = PXC_AddFontW(mydoc, 400, false, L"Arial Unicode MS", &fontid);
hr = PXC_SetCurrentFont(page, fontid, 10);
hr = PXC_TextOutW(page, &or, UnicodeHebrewString, -1);
hr = PXC_WriteDocumentExW(mydoc, L"temp.pdf", -1, 0, NULL);

Additional Notes:
1] I found a similar topic posted some years ago on the forum (the thread was entitled: "Searchable PDFs"). However, although there was an acknowledgement in that thread that the bug affected multiple languages, there was no resolution posted within the thread.
2] I also tried starting with a properly searchable Hebrew PDF, and then running the ExtractTextToOtherPDFDocument() function [provided with the help file, in the entry for PXCp_ET_AnalyzePageContent], in order to copy the contents of the PDF file to a new PDF file. The resulting PDF file did indeed look just like the original. However, whereas the original file could be successfully searched in Acrobat, the newly created file could not. Indeed, when comparing the PDF structures of these two documents in Virgo UnionStation, I found that my original file's text nodes displayed as regular Hebrew, while the text nodes of the newly created file contained gibberish in the structure viewer. Of course, as noted, the resulting file did look correct, so the SDK seems to be successfully mapping these non-Hebrew "gibberish" characters to real Hebrew glyphs. However, the result is that the resulting PDF is not searchable.
ashmid
User
Posts: 27
Joined: Fri Jan 06, 2012 12:35 am

Re: Non-English text not searchable in resulting PDF

Post by ashmid »

OK, I think I've found the solution. I added the following like:
hr = PXC_SetEmbeddingOptions(mydoc, TRUE, TRUE, TRUE);
The last TRUE adds a "ToUnicode" table to the PDF doc, and now my PDF's are indeed searchable.
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7463
Joined: Wed Mar 25, 2009 10:37 pm

Re: Non-English text not searchable in resulting PDF

Post by Paul - PDF-XChange »

Hi ashmid,

that's great to hear! Just be sure to let us know if there is anything you need from us here.

regards
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com