Non-English text not searchable in resulting PDF

ashmid · Post by **ashmid** » Fri Jan 06, 2012 1:39 am

Hello,
I've just gotten started with the SDK, which I am currently evaluating it for purchase. Because we do a lot of work in non-English languages, specifically Hebrew, the first thing I tried was to create a sample document containing Hebrew characters. The document was created successfully; however, I found that I was not able to successfully search for the Hebrew characters within Adobe Acrobat. In contrast, when I create documents with English characters, or with numbers, the characters are searchable. There seems to be something about the encoding/mapping of the non-ASCII characters that is preventing a proper search.
The code I am using goes like this (I use the Arial Unicode MS font because of its wideranging support for virtually all unicode chars):
hr = PXCp_Init(&pdfdoc, NULL, NULL);
hr = PXC_NewDocument(&mydoc, NULL, NULL);
hr = PXC_AddPage(mydoc, 850, 1100, &page);
hr = PXC_AddFontW(mydoc, 400, false, L"Arial Unicode MS", &fontid);
hr = PXC_SetCurrentFont(page, fontid, 10);
hr = PXC_TextOutW(page, &or, UnicodeHebrewString, -1);
hr = PXC_WriteDocumentExW(mydoc, L"temp.pdf", -1, 0, NULL);

Additional Notes:
1] I found a similar topic posted some years ago on the forum (the thread was entitled: "Searchable PDFs"). However, although there was an acknowledgement in that thread that the bug affected multiple languages, there was no resolution posted within the thread.
2] I also tried starting with a properly searchable Hebrew PDF, and then running the ExtractTextToOtherPDFDocument() function [provided with the help file, in the entry for PXCp_ET_AnalyzePageContent], in order to copy the contents of the PDF file to a new PDF file. The resulting PDF file did indeed look just like the original. However, whereas the original file could be successfully searched in Acrobat, the newly created file could not. Indeed, when comparing the PDF structures of these two documents in Virgo UnionStation, I found that my original file's text nodes displayed as regular Hebrew, while the text nodes of the newly created file contained gibberish in the structure viewer. Of course, as noted, the resulting file did look correct, so the SDK seems to be successfully mapping these non-Hebrew "gibberish" characters to real Hebrew glyphs. However, the result is that the resulting PDF is not searchable.

ashmid · Post by **ashmid** » Fri Jan 06, 2012 2:06 am

OK, I think I've found the solution. I added the following like:
hr = PXC_SetEmbeddingOptions(mydoc, TRUE, TRUE, TRUE);
The last TRUE adds a "ToUnicode" table to the PDF doc, and now my PDF's are indeed searchable.

Post by **Paul - PDF-XChange** » Fri Jan 06, 2012 5:14 pm

Hi ashmid,

that's great to hear! Just be sure to let us know if there is anything you need from us here.

regards

Non-English text not searchable in resulting PDF

Non-English text not searchable in resulting PDF

Re: Non-English text not searchable in resulting PDF

Re: Non-English text not searchable in resulting PDF