OCR_LoadA not possible with umlauts

Dorwol · Post by **Dorwol** » Fri Feb 24, 2012 3:10 pm

Hello!

I need very fast a bugfix!

res = OCR_LoadA(doc, FileName) will not work, if the Filename has "öäü" oder "ÖÄÜ" inside!

For example

res = OCR_LoadA(doc, "C:\Test.pdf") will work!

res = OCR_LoadA(doc, "C:\Testö.pdf") will not work!

But the biggest problem is, I need very urgend a bugfix!

CAN YOU HELP PLEASE?

Fri Feb 24, 2012 3:25 pm

Hello Dorwol,

I will pass this to Walter, and he will advise here in this topic a bit later today.

Best,
Stefan

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Fri Feb 24, 2012 5:10 pm

OCR_LoadA() only accepts ASCII strings (char*, "ascii string literal", LPSTR, CStringA, etc).

Use OCR_LoadW() and pass the filename as a wide string (ie, L"unicodestring", or WCHAR*/wchar_t*/LPWSTR/CStringW/BSTR/etc). That will take care of your umlaut or any other unicode character.

-Walter

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Fri Feb 24, 2012 5:24 pm

BTW, if this doesn't work and you still have trouble, please email us at [email protected]. We watch the forum constantly, but an email will get our attention fastest.

-Walter

Dorwol · Post by **Dorwol** » Sat Feb 25, 2012 5:53 am

Walter-Tracker Supp wrote:Use OCR_LoadW() and pass the filename as a wide string ....That will take care of your umlaut or any other unicode character.

Thank you for this very fast help. Marvelous! And yes, this works!

...

...but...

Walter-Tracker Supp wrote:OCR_LoadA() only accepts ASCII strings

1. "Umlauts" is no Unicode. It's part of the regulary ASCII Table (for example asc("ä") is ASCII-value 228 !).
2. OCR_SaveA() from your ocrtools.dll will work with "umlauts"

So why does SaveA works but LoadA does not?
3. All your other components will work with "umlauts".

...so I am a little bit uncertain whether this is really correct.

Mon Feb 27, 2012 11:16 am

Hello Dorwol,

There are a myriad of ASCII variants, and the ASCII value interpreted at your end as "ä" could be a Greek Sigma (Σ) (Windows 1253), or the letter "д" in a Cyrillic version of the codetable (Windows-1251).

So ideally OCR_LoadA() should only be used if all of the symbols in the name and path are from the "English part" of the codetable. I will ask Walter to elaborate on why SaveA works and LoadA doesn't in your case.

Best,
Stefan

Dorwol · Post by **Dorwol** » Mon Feb 27, 2012 11:20 am

Tracker Supp-Stefan wrote:I will ask Walter to elaborate on why SaveA works and LoadA doesn't in your case.

Yes, please!

Mon Feb 27, 2012 11:53 am

Walter-Tracker Supp · Post by **Walter-Tracker Supp** » Mon Feb 27, 2012 6:15 pm

The short answer is that you should only use the ASCII version of the load and save functions for characters #32 (space) to #126 (tilde) - the lower "standard" part of the ASCII character table. These include un-accented latin letters, numbers, and the symbols that happen to be on English keyboards (!@#$%^&*()_+-=, etc). Otherwise, use the wide / unicode version. It avoids all kinds of potential complications.

In essence, not all umlaut-a are created equal. Each character in ASCII is only represented by 8 bits (1 byte), which only gives 255 possible characters, so to accommodate different languages (european languages, cyrillic, greek, etc), extended code pages were developed. The extended regions stretch from #128 up to #255 (decimal), and the contents of this region may vary depending on the code page that is in effect. Letters which look the same may have different numbers in different code pages (or alternatively, the same number may have a different letter meaning depending on the code page in use).

For example, on code page 437 (which was the standard on many older pre-unicode omputers, and sometimes still used), LATIN SMALL LETTER A WITH DIAERESIS (umlaut a) is hex code #84 (decimal 132). In other common code pages (and also, unicode), the same letter is represented by hex code #E4 (decimal 228). So the meaning of umlaut-A is really not particularly clear cut in ASCII (even if most modern code pages try to use decimal 228 for it).

So best to just bypass the whole mess and use unicode functions.

As to why Load fails but Save works - well, Load must perfectly match whatever is on the filesystem to find the file. Save can save to any filename you want

-Walter

OCR_LoadA not possible with umlauts

OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts

Re: OCR_LoadA not possible with umlauts