Hello,
I've just gotten started with the SDK, which I am currently evaluating it for purchase. Because we do a lot of work in non-English languages, specifically Hebrew, the first thing I tried was to create a sample document containing Hebrew characters. The document was created successfully; however, I found that I was not able to successfully search for the Hebrew characters within Adobe Acrobat. In contrast, when I create documents with English characters, or with numbers, the characters are searchable. There seems to be something about the encoding/mapping of the non-ASCII characters that is preventing a proper search.
The code I am using goes like this (I use the Arial Unicode MS font because of its wideranging support for virtually all unicode chars):
hr = PXCp_Init(&pdfdoc, NULL, NULL);
hr = PXC_NewDocument(&mydoc, NULL, NULL);
hr = PXC_AddPage(mydoc, 850, 1100, &page);
hr = PXC_AddFontW(mydoc, 400, false, L"Arial Unicode MS", &fontid);
hr = PXC_SetCurrentFont(page, fontid, 10);
hr = PXC_TextOutW(page, &or, UnicodeHebrewString, -1);
hr = PXC_WriteDocumentExW(mydoc, L"temp.pdf", -1, 0, NULL);
Additional Notes:
1] I found a similar topic posted some years ago on the forum (the thread was entitled: "Searchable PDFs"). However, although there was an acknowledgement in that thread that the bug affected multiple languages, there was no resolution posted within the thread.
2] I also tried starting with a properly searchable Hebrew PDF, and then running the ExtractTextToOtherPDFDocument() function [provided with the help file, in the entry for PXCp_ET_AnalyzePageContent], in order to copy the contents of the PDF file to a new PDF file. The resulting PDF file did indeed look just like the original. However, whereas the original file could be successfully searched in Acrobat, the newly created file could not. Indeed, when comparing the PDF structures of these two documents in Virgo UnionStation, I found that my original file's text nodes displayed as regular Hebrew, while the text nodes of the newly created file contained gibberish in the structure viewer. Of course, as noted, the resulting file did look correct, so the SDK seems to be successfully mapping these non-Hebrew "gibberish" characters to real Hebrew glyphs. However, the result is that the resulting PDF is not searchable.
Non-English text not searchable in resulting PDF
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange
-
ashmid
- User
- Posts: 27
- Joined: Fri Jan 06, 2012 12:35 am
-
ashmid
- User
- Posts: 27
- Joined: Fri Jan 06, 2012 12:35 am
Re: Non-English text not searchable in resulting PDF
OK, I think I've found the solution. I added the following like:
hr = PXC_SetEmbeddingOptions(mydoc, TRUE, TRUE, TRUE);
The last TRUE adds a "ToUnicode" table to the PDF doc, and now my PDF's are indeed searchable.
hr = PXC_SetEmbeddingOptions(mydoc, TRUE, TRUE, TRUE);
The last TRUE adds a "ToUnicode" table to the PDF doc, and now my PDF's are indeed searchable.
-
Paul - PDF-XChange
- Site Admin
- Posts: 7463
- Joined: Wed Mar 25, 2009 10:37 pm
Re: Non-English text not searchable in resulting PDF
Hi ashmid,
that's great to hear! Just be sure to let us know if there is anything you need from us here.
regards
that's great to hear! Just be sure to let us know if there is anything you need from us here.
regards
Best regards
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com