OCR_LoadA not possible with umlauts

PDF-X OCR SDK is a New product from us and intended to compliment our existing PDF and Imaging Tools to provide the Developer with an expanding set of professional tools for Optical Character Recognition tasks

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange

Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

OCR_LoadA not possible with umlauts

Post by Dorwol »

Hello!

I need very fast a bugfix!

res = OCR_LoadA(doc, FileName) will not work, if the Filename has "öäü" oder "ÖÄÜ" inside!

For example

res = OCR_LoadA(doc, "C:\Test.pdf") will work!

res = OCR_LoadA(doc, "C:\Testö.pdf") will not work!

But the biggest problem is, I need very urgend a bugfix!

CAN YOU HELP PLEASE?
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19942
Joined: Mon Jan 12, 2009 8:07 am

Re: OCR_LoadA not possible with umlauts

Post by Stefan - PDF-XChange »

Hello Dorwol,

I will pass this to Walter, and he will advise here in this topic a bit later today.

Best,
Stefan
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR_LoadA not possible with umlauts

Post by Walter-Tracker Supp »

OCR_LoadA() only accepts ASCII strings (char*, "ascii string literal", LPSTR, CStringA, etc).

Use OCR_LoadW() and pass the filename as a wide string (ie, L"unicodestring", or WCHAR*/wchar_t*/LPWSTR/CStringW/BSTR/etc). That will take care of your umlaut or any other unicode character.

-Walter
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR_LoadA not possible with umlauts

Post by Walter-Tracker Supp »

BTW, if this doesn't work and you still have trouble, please email us at [email protected]. We watch the forum constantly, but an email will get our attention fastest.

-Walter
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: OCR_LoadA not possible with umlauts

Post by Dorwol »

Walter-Tracker Supp wrote:Use OCR_LoadW() and pass the filename as a wide string ....That will take care of your umlaut or any other unicode character.
Thank you for this very fast help. Marvelous! And yes, this works! :D ...

...but...
Walter-Tracker Supp wrote:OCR_LoadA() only accepts ASCII strings
1. "Umlauts" is no Unicode. It's part of the regulary ASCII Table (for example asc("ä") is ASCII-value 228 !).
2. OCR_SaveA() from your ocrtools.dll will work with "umlauts" :!: So why does SaveA works but LoadA does not?
3. All your other components will work with "umlauts".

...so I am a little bit uncertain whether this is really correct. :|
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19942
Joined: Mon Jan 12, 2009 8:07 am

Re: OCR_LoadA not possible with umlauts

Post by Stefan - PDF-XChange »

Hello Dorwol,

There are a myriad of ASCII variants, and the ASCII value interpreted at your end as "ä" could be a Greek Sigma (Σ) (Windows 1253), or the letter "д" in a Cyrillic version of the codetable (Windows-1251).

So ideally OCR_LoadA() should only be used if all of the symbols in the name and path are from the "English part" of the codetable. I will ask Walter to elaborate on why SaveA works and LoadA doesn't in your case.

Best,
Stefan
Dorwol
User
Posts: 275
Joined: Mon Aug 04, 2008 5:04 pm

Re: OCR_LoadA not possible with umlauts

Post by Dorwol »

Tracker Supp-Stefan wrote:I will ask Walter to elaborate on why SaveA works and LoadA doesn't in your case.
Yes, please! :)
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19942
Joined: Mon Jan 12, 2009 8:07 am

Re: OCR_LoadA not possible with umlauts

Post by Stefan - PDF-XChange »

:)
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: OCR_LoadA not possible with umlauts

Post by Walter-Tracker Supp »

The short answer is that you should only use the ASCII version of the load and save functions for characters #32 (space) to #126 (tilde) - the lower "standard" part of the ASCII character table. These include un-accented latin letters, numbers, and the symbols that happen to be on English keyboards (!@#$%^&*()_+-=, etc). Otherwise, use the wide / unicode version. It avoids all kinds of potential complications.

In essence, not all umlaut-a are created equal. Each character in ASCII is only represented by 8 bits (1 byte), which only gives 255 possible characters, so to accommodate different languages (european languages, cyrillic, greek, etc), extended code pages were developed. The extended regions stretch from #128 up to #255 (decimal), and the contents of this region may vary depending on the code page that is in effect. Letters which look the same may have different numbers in different code pages (or alternatively, the same number may have a different letter meaning depending on the code page in use).

For example, on code page 437 (which was the standard on many older pre-unicode omputers, and sometimes still used), LATIN SMALL LETTER A WITH DIAERESIS (umlaut a) is hex code #84 (decimal 132). In other common code pages (and also, unicode), the same letter is represented by hex code #E4 (decimal 228). So the meaning of umlaut-A is really not particularly clear cut in ASCII (even if most modern code pages try to use decimal 228 for it).

So best to just bypass the whole mess and use unicode functions.

As to why Load fails but Save works - well, Load must perfectly match whatever is on the filesystem to find the file. Save can save to any filename you want ;)

-Walter