Latin and Gothic Letters for OCR

Discussion for the End User use of OCR in PDF-XChange Editor and Viewer

Moderators: Daniel - PDF-XChange, PDF-XChange Support, Vasyl - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Paul - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

Post Reply
Ludwig
User
Posts: 17
Joined: Sun Feb 24, 2013 1:52 pm

Latin and Gothic Letters for OCR

Post by Ludwig »

Hi there,

I would like to ask if you could please add Latin to the list of additional OCR-languages. But for me even more important would be the possibility to ocr books with Gothic/Black letters. Especially I am looking for the so called "Unger-Fraktur" as many German books from the 19th century have been printed using these letters. Do you think this is possible?

Thanks a lot
Ludwig
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Latin and Gothic Letters for OCR

Post by Walter-Tracker Supp »

We will add Slovakian, Swedish, and German "fraktur" language data in the final release of the editor. We will not have direct Latin support, though results using English (or even other Latin alphabet) language selection will be fairly good since the word dictionary weighting is fairly weak (ie, it will not dominate results too seriously).

-Walter
Ludwig
User
Posts: 17
Joined: Sun Feb 24, 2013 1:52 pm

Re: Latin and Gothic Letters for OCR

Post by Ludwig »

I am very much looking forward to the final release then! Do you have a rough time horizon?
User avatar
Paul - PDF-XChange
Site Admin
Posts: 7356
Joined: Wed Mar 25, 2009 10:37 pm
Contact:

Re: Latin and Gothic Letters for OCR

Post by Paul - PDF-XChange »

Hi Ludwig,

Walter tells me this should be available in the next few weeks.

hth
Best regards

Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Latin and Gothic Letters for OCR

Post by Walter-Tracker Supp »

Ludwig, I have prepared the Fraktur language pack and sent it to our installation guys. It may be a few days before it becomes available on the website but I thought I would update you to let you know that it will be very soon. It will work with both the viewer and the editor.

-Walter
Ludwig
User
Posts: 17
Joined: Sun Feb 24, 2013 1:52 pm

Re: Latin and Gothic Letters for OCR

Post by Ludwig »

Thanks Walter! This is really good news.
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19794
Joined: Mon Jan 12, 2009 8:07 am
Contact:

Re: Latin and Gothic Letters for OCR

Post by Stefan - PDF-XChange »

:)
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Latin and Gothic Letters for OCR

Post by Walter-Tracker Supp »

Ludwig, I have attached the language pack to this post, because I guess it will still be a few days since our installer people are very busy with the new editor release. You will have to place them in your language directory yourself, and we cannot provide support for this since we will have a proper installer generated pretty shortly. Languages for the *viewer* are placed in a directory called "ocrdats" off the main Viewer installation directory, e.g.:

C:\Program Files\Tracker Software\PDF Viewer\ocrdats

In the editor, you will have to find PluginsData\OCRLanguages, e.g.:

C:\Program Files\Tracker Software\PDF Editor\PluginsData\OCRLanguages

Copy all the .lng and .dat files into those directories and you should see the Fraktur choices in your OCR preferences / run dialog.

-Walter
Attachments
Fraktur-Language-Pack.7z
(1.46 MiB) Downloaded 512 times
Ludwig
User
Posts: 17
Joined: Sun Feb 24, 2013 1:52 pm

Re: Latin and Gothic Letters for OCR

Post by Ludwig »

Hi Walter,
Thank you very much for the files. Using the Viewer Pro (not the Editor) I tried the new German Fraktur (don't really know what Swedish and Slovakian Fraktur is though, so I didn't try those) on three books so far: Very promissing! Great job!
Ludwig
User avatar
Will - Tracker Supp
Site Admin
Posts: 6815
Joined: Mon Oct 15, 2012 9:21 pm
Contact:

Re: Latin and Gothic Letters for OCR

Post by Will - Tracker Supp »

Great! I'll pass the message along to Walter :D
If posting files to this forum, you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded.
Thank you.

Best regards

Will Travaglini
Tracker Support (Europe)
Tracker Software Products Ltd.
http://www.tracker-software.com
Ludwig
User
Posts: 17
Joined: Sun Feb 24, 2013 1:52 pm

Re: Latin and Gothic Letters for OCR

Post by Ludwig »

Hi, is there a way to train the OCR programm for better Fraktur-letter-detection? I found out that the programm systematically misreads "ch" what is turned into just "c" then. For example "Bezeicnung" instead of "Bezeichnung".
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Latin and Gothic Letters for OCR

Post by Walter-Tracker Supp »

Not at the moment. We may release a tool to help with training in the future. However, if you feel ambitious you can email us at support@pdf-xchange.com and I can point you in the right direction, but can't provide detailed support for it - you'd be on your own.
zzmarko
User
Posts: 1
Joined: Sat Nov 09, 2013 9:25 am

Re: Latin and Gothic Letters for OCR

Post by zzmarko »

Walter-Tracker Supp wrote:We will add Slovakian, Swedish, and German "fraktur" language data in the final release of the editor. We will not have direct Latin support, though results using English (or even other Latin alphabet) language selection will be fairly good since the word dictionary weighting is fairly weak (ie, it will not dominate results too seriously).

-Walter
will be may added Croatian language ?

thank you
Walter-Tracker Supp
User
Posts: 381
Joined: Mon Jun 13, 2011 5:10 pm

Re: Latin and Gothic Letters for OCR

Post by Walter-Tracker Supp »

Croatian will be available on or before the next build, anticipated in about a month's time. Meanwhile you can use any other language we provide which uses the same diacritics, if applicable (I'm not familiar with Croatian myself), because the word dictionary coupling is weak.

I will update this forum posting once we have included it.
Leonatus
User
Posts: 1
Joined: Sat Nov 26, 2016 12:30 pm

Re: Latin and Gothic Letters for OCR

Post by Leonatus »

This thread is quite old; nevertheless I wished to exress my big thanks for the "german Fraktur" ocr set! I had been desperately searching for this Feature!
User avatar
John - Tracker Supp
Site Admin
Posts: 5223
Joined: Tue Jun 29, 2004 10:34 am
Contact:

Re: Latin and Gothic Letters for OCR

Post by John - Tracker Supp »

Pleasure :)
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.

Best regards
Tracker Support
http://www.tracker-software.com
YC Niu
User
Posts: 7
Joined: Fri Jun 14, 2024 7:05 am

Re: Latin and Gothic Letters for OCR

Post by YC Niu »

(Revised for clarity)

Hi, I would like to know which "OCR language" option can OCR the diacritics in the IAST set [1]. They include these 17-pair diacritic characters:

Ā Ī Ū ṚṜ ḶḸḺ Ṃ ṄÑṆ Ḥ Ṭ Ḍ ŚṢ
ā ī ū ṛṝ ḷḹḻ ṃ ṅñṇ ḥ ṭ ḍ śṣ

The examples of IAST text can be seen in [2], [3]. The [2] is a scanned PDF, an example I want to OCR.

I am a native Chinese user unfamiliar with the "OCR language" choice in this situation. So, I blindly tried some Europe-related OCR language options, as listed in [4], but none was the correct choice.

Best Regards.

YC Niu

Reference
[1] Diacritics in IAST set
https://en.wikipedia.org/wiki/International_Alphabet_of_Sanskrit_Transliteration

[2] Example 1 of IAST text (scanned PDF)
https://archive.org/details/dhatukatha-pts/PTS-Digha-Nikaya-vol-I-TWRD-Carpenter-1899/page/1/mode/2up

[3] Example 2 of IAST text (HTML; right-half side of the web page)
https://suttacentral.net/dn1/en/sujato?layout=sidebyside&reference=none&notes=asterisk&highlight=false&script=latin

[4] In this list, none of them are suitable for IAST.
localname="Čeština" name="Czech"
localname="Deutsch" name="German"
localname="Español" name="Spanish; Castilian"
localname="Français" name="French"
localname="Română" name="Romanian; Moldavian; Moldovan"
localname="Suomi" name="Finnish"
Last edited by YC Niu on Sat Jun 15, 2024 5:40 am, edited 17 times in total.
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10910
Joined: Wed Jan 03, 2018 6:52 pm

Re: Latin and Gothic Letters for OCR

Post by Daniel - PDF-XChange »

Hello, YC Niu

OCR is intended for direct character recognition, not necessarily transliteration, which is a very complex feature to implement. We will look into it, and see what we can offer, but I cannot promise that this will be possible. Nor can I say that this will be something we could offer an any short timeframe.

RT#6964: FR: transliterate phonetically

Kind regards,
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
YC Niu
User
Posts: 7
Joined: Fri Jun 14, 2024 7:05 am

Re: Latin and Gothic Letters for OCR

Post by YC Niu »

Dan,

Thank you for your reply. My English is quite limited. To avoid my question not being clear and to avoid misunderstanding your answer, I tried to describe my question here in another way.

I want "direct character recognition" without "transliteration." The problem is I do not know which option of "OCR language" that can correctly recognize these 17-pair diacritic characters:

Ā Ī Ū ṚṜ ḶḸḺ Ṃ ṄÑṆ Ḥ Ṭ Ḍ ŚṢ
ā ī ū ṛṝ ḷḹḻ ṃ ṅñṇ ḥ ṭ ḍ śṣ

I want to OCR (direct character recognition) these 17-pair diacritic characters.

For clarity, I revised my original post, including the IAST-related info [1]-[3].

Best Regards,

YC Niu
User avatar
Stefan - PDF-XChange
Site Admin
Posts: 19794
Joined: Mon Jan 12, 2009 8:07 am
Contact:

Re: Latin and Gothic Letters for OCR

Post by Stefan - PDF-XChange »

Hello YC Niu,

The issue is that your text you want to recognize is already a transliteration of the original Sanskrit one. So the letters our OCR engine can recognize would not match any specific language and it's corresponding dictionary, that is why this is still a transliteration and quite a more complex task than recognizing letters written in the original script of a language.

Thanks for your clarification - I've added a note to the ticket Dan created so that our devs can check this further.

Kind regards,
Stefan
YC Niu
User
Posts: 7
Joined: Fri Jun 14, 2024 7:05 am

Re: Latin and Gothic Letters for OCR

Post by YC Niu »

Hello, Stefan

Thank you for your explanation. Now I understand the difficulty, and it's clear that there is no ready-made solution. I'm ok with this. Also, thanks, Dan.

Best regard,

YC Niu
User avatar
Daniel - PDF-XChange
Site Admin
Posts: 10910
Joined: Wed Jan 03, 2018 6:52 pm

Latin and Gothic Letters for OCR

Post by Daniel - PDF-XChange »

:)
Dan McIntyre - Support Technician
PDF-XChange Co. LTD

+++++++++++++++++++++++++++++++++++
Our Web site domain and email address has changed as of 26/10/2023.
https://www.pdf-xchange.com
Support@pdf-xchange.com
Post Reply