Page 1 of 1

Latin and Gothic Letters for OCR

Posted: Wed Oct 16, 2013 4:36 pm
by Ludwig
Hi there,

I would like to ask if you could please add Latin to the list of additional OCR-languages. But for me even more important would be the possibility to ocr books with Gothic/Black letters. Especially I am looking for the so called "Unger-Fraktur" as many German books from the 19th century have been printed using these letters. Do you think this is possible?

Thanks a lot
Ludwig

Re: Latin and Gothic Letters for OCR

Posted: Wed Oct 16, 2013 4:53 pm
by Walter-Tracker Supp
We will add Slovakian, Swedish, and German "fraktur" language data in the final release of the editor. We will not have direct Latin support, though results using English (or even other Latin alphabet) language selection will be fairly good since the word dictionary weighting is fairly weak (ie, it will not dominate results too seriously).

-Walter

Re: Latin and Gothic Letters for OCR

Posted: Wed Oct 16, 2013 8:24 pm
by Ludwig
I am very much looking forward to the final release then! Do you have a rough time horizon?

Re: Latin and Gothic Letters for OCR

Posted: Wed Oct 16, 2013 8:46 pm
by Paul - PDF-XChange
Hi Ludwig,

Walter tells me this should be available in the next few weeks.

hth

Re: Latin and Gothic Letters for OCR

Posted: Sat Oct 19, 2013 12:09 am
by Walter-Tracker Supp
Ludwig, I have prepared the Fraktur language pack and sent it to our installation guys. It may be a few days before it becomes available on the website but I thought I would update you to let you know that it will be very soon. It will work with both the viewer and the editor.

-Walter

Re: Latin and Gothic Letters for OCR

Posted: Sat Oct 19, 2013 9:05 am
by Ludwig
Thanks Walter! This is really good news.

Re: Latin and Gothic Letters for OCR

Posted: Mon Oct 21, 2013 10:28 am
by Stefan - PDF-XChange
:)

Re: Latin and Gothic Letters for OCR

Posted: Tue Oct 22, 2013 6:19 pm
by Walter-Tracker Supp
Ludwig, I have attached the language pack to this post, because I guess it will still be a few days since our installer people are very busy with the new editor release. You will have to place them in your language directory yourself, and we cannot provide support for this since we will have a proper installer generated pretty shortly. Languages for the *viewer* are placed in a directory called "ocrdats" off the main Viewer installation directory, e.g.:

C:\Program Files\Tracker Software\PDF Viewer\ocrdats

In the editor, you will have to find PluginsData\OCRLanguages, e.g.:

C:\Program Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages

Copy all the .lng and .dat files into those directories and you should see the Fraktur choices in your OCR preferences / run dialog.

-Walter

Re: Latin and Gothic Letters for OCR

Posted: Wed Oct 23, 2013 11:17 am
by Ludwig
Hi Walter,
Thank you very much for the files. Using the Viewer Pro (not the Editor) I tried the new German Fraktur (don't really know what Swedish and Slovakian Fraktur is though, so I didn't try those) on three books so far: Very promissing! Great job!
Ludwig

Re: Latin and Gothic Letters for OCR

Posted: Wed Oct 23, 2013 4:35 pm
by Will - Tracker Supp
Great! I'll pass the message along to Walter :D

Re: Latin and Gothic Letters for OCR

Posted: Thu Oct 24, 2013 11:05 am
by Ludwig
Hi, is there a way to train the OCR programm for better Fraktur-letter-detection? I found out that the programm systematically misreads "ch" what is turned into just "c" then. For example "Bezeicnung" instead of "Bezeichnung".

Re: Latin and Gothic Letters for OCR

Posted: Thu Oct 24, 2013 5:01 pm
by Walter-Tracker Supp
Not at the moment. We may release a tool to help with training in the future. However, if you feel ambitious you can email us at support@pdf-xchange.com and I can point you in the right direction, but can't provide detailed support for it - you'd be on your own.

Re: Latin and Gothic Letters for OCR

Posted: Sat Nov 09, 2013 9:39 am
by zzmarko
Walter-Tracker Supp wrote:We will add Slovakian, Swedish, and German "fraktur" language data in the final release of the editor. We will not have direct Latin support, though results using English (or even other Latin alphabet) language selection will be fairly good since the word dictionary weighting is fairly weak (ie, it will not dominate results too seriously).

-Walter
will be may added Croatian language ?

thank you

Re: Latin and Gothic Letters for OCR

Posted: Mon Nov 11, 2013 5:08 pm
by Walter-Tracker Supp
Croatian will be available on or before the next build, anticipated in about a month's time. Meanwhile you can use any other language we provide which uses the same diacritics, if applicable (I'm not familiar with Croatian myself), because the word dictionary coupling is weak.

I will update this forum posting once we have included it.

Re: Latin and Gothic Letters for OCR

Posted: Sat Nov 26, 2016 12:38 pm
by Leonatus
This thread is quite old; nevertheless I wished to exress my big thanks for the "german Fraktur" ocr set! I had been desperately searching for this Feature!

Re: Latin and Gothic Letters for OCR

Posted: Sat Nov 26, 2016 12:42 pm
by John - Tracker Supp
Pleasure :)

Re: Latin and Gothic Letters for OCR

Posted: Fri Jun 14, 2024 7:19 am
by YC Niu
(Revised for clarity)

Hi, I would like to know which "OCR language" option can OCR the diacritics in the IAST set [1]. They include these 17-pair diacritic characters:

Ā Ī Ū ṚṜ ḶḸḺ Ṃ ṄÑṆ Ḥ Ṭ Ḍ ŚṢ
ā ī ū ṛṝ ḷḹḻ ṃ ṅñṇ ḥ ṭ ḍ śṣ

The examples of IAST text can be seen in [2], [3]. The [2] is a scanned PDF, an example I want to OCR.

I am a native Chinese user unfamiliar with the "OCR language" choice in this situation. So, I blindly tried some Europe-related OCR language options, as listed in [4], but none was the correct choice.

Best Regards.

YC Niu

Reference
[1] Diacritics in IAST set
https://en.wikipedia.org/wiki/International_Alphabet_of_Sanskrit_Transliteration

[2] Example 1 of IAST text (scanned PDF)
https://archive.org/details/dhatukatha-pts/PTS-Digha-Nikaya-vol-I-TWRD-Carpenter-1899/page/1/mode/2up

[3] Example 2 of IAST text (HTML; right-half side of the web page)
https://suttacentral.net/dn1/en/sujato?layout=sidebyside&reference=none&notes=asterisk&highlight=false&script=latin

[4] In this list, none of them are suitable for IAST.
localname="Čeština" name="Czech"
localname="Deutsch" name="German"
localname="Español" name="Spanish; Castilian"
localname="Français" name="French"
localname="Română" name="Romanian; Moldavian; Moldovan"
localname="Suomi" name="Finnish"

Re: Latin and Gothic Letters for OCR

Posted: Fri Jun 14, 2024 7:12 pm
by Daniel - PDF-XChange
Hello, YC Niu

OCR is intended for direct character recognition, not necessarily transliteration, which is a very complex feature to implement. We will look into it, and see what we can offer, but I cannot promise that this will be possible. Nor can I say that this will be something we could offer an any short timeframe.

RT#6964: FR: transliterate phonetically

Kind regards,

Re: Latin and Gothic Letters for OCR

Posted: Sat Jun 15, 2024 4:07 am
by YC Niu
Dan,

Thank you for your reply. My English is quite limited. To avoid my question not being clear and to avoid misunderstanding your answer, I tried to describe my question here in another way.

I want "direct character recognition" without "transliteration." The problem is I do not know which option of "OCR language" that can correctly recognize these 17-pair diacritic characters:

Ā Ī Ū ṚṜ ḶḸḺ Ṃ ṄÑṆ Ḥ Ṭ Ḍ ŚṢ
ā ī ū ṛṝ ḷḹḻ ṃ ṅñṇ ḥ ṭ ḍ śṣ

I want to OCR (direct character recognition) these 17-pair diacritic characters.

For clarity, I revised my original post, including the IAST-related info [1]-[3].

Best Regards,

YC Niu

Re: Latin and Gothic Letters for OCR

Posted: Mon Jun 17, 2024 8:22 am
by Stefan - PDF-XChange
Hello YC Niu,

The issue is that your text you want to recognize is already a transliteration of the original Sanskrit one. So the letters our OCR engine can recognize would not match any specific language and it's corresponding dictionary, that is why this is still a transliteration and quite a more complex task than recognizing letters written in the original script of a language.

Thanks for your clarification - I've added a note to the ticket Dan created so that our devs can check this further.

Kind regards,
Stefan

Re: Latin and Gothic Letters for OCR

Posted: Mon Jun 17, 2024 4:38 pm
by YC Niu
Hello, Stefan

Thank you for your explanation. Now I understand the difficulty, and it's clear that there is no ready-made solution. I'm ok with this. Also, thanks, Dan.

Best regard,

YC Niu

Latin and Gothic Letters for OCR

Posted: Mon Jun 17, 2024 4:49 pm
by Daniel - PDF-XChange
:)