Killer Feature Request: LLM OCR and/or Translation

mCHSNUg5Pz8cPap · Post by **mCHSNUg5Pz8cPap** » Thu Feb 06, 2025 1:00 am

I have been using LLMs for OCR and translation of certain documents, especially older documents and handwritten documents. They do a fantastic job. For evidence of their OCR capability, I point you to this thread on Hacker News: https://news.ycombinator.com/item?id=42952605. I have been able to OCR horrible handwritten recipes written by parents and grandparents and translate old Japanese documents with great success.

My suggestion is to add a built-in ability to OCR and translate documents into PDF-XChange using the API of the big three AI companies—i.e., Google, Anthropic, and Gemini. You would control and enable the models the users can use so that you make sure they are using ones that offer vision, etc. Right now, the leading models for this purpose from each company are Gemini 2.0 Flash, Claude 3.5 Sonnet, and ChatGTP-4o, respectively.

The user would enable this by entering an API key for one or all three of the AI companies, which thereby unlock the ability to use that company's LLMs to OCR and translate documents.

This would be an absolute killer feature.

Loki@99 · Post by **Loki@99** » Thu Feb 06, 2025 7:10 am

I totally agree with this as I commented something somehow similar in this topic using Microsoft Snipping tool

There is also this other one related to translation.

I would assume that AI is mature enough now for such task depending on the model used.

Thu Feb 06, 2025 5:00 pm

Hello,

This is one area where we cannot make any promises for what may or may not happen in the future. There are aspects which we are looking into, but everything in this sector is much more complex to implement than it seems on the surface. Not just from a development viewpoint, but also from both the licensing (between us and the provider) and general use perspectives as well.

Something may happen in the future, but it goes beyond the purview of even the Dev teams decision making at this point in time, so I cannot comment, nor make a development ticket for it at this time.

Kind regards,

mCHSNUg5Pz8cPap · Post by **mCHSNUg5Pz8cPap** » Fri Feb 07, 2025 5:20 pm

Daniel - PDF-XChange wrote: ↑Thu Feb 06, 2025 5:00 pm This is one area where we cannot make any promises for what may or may not happen in the future. There are aspects which we are looking into, but everything in this sector is much more complex to implement than it seems on the surface. Not just from a development viewpoint, but also from both the licensing (between us and the provider) and general use perspectives as well.

I'm glad to hear you are looking into it. As for licensing, there is no need for licensing if you just create a connection to the model's API and users enter their API keys.

David.P · Post by **David.P** » Fri Feb 07, 2025 6:54 pm

I really like this feature idea.

Have the OCR capabilities of the LLMs improved that much in the meantime?

I remember a few months ago trying to use LLM-OCR a book handwritten by my great-grandfather's generation in the old German "Sütterlin" handwriting, which wasn't successful back then.

I also think that implementing this feature would be relatively simple, at least in a basic form that doesn't attempt to align the translated text with the original on the page.

Essentially, you just upload a bitmap image of the page to be translated to the LLM and receive the text back.

Nonetheless, I concur that this wouldn't entirely meet the conventional definition of OCR'ing a PDF.

mCHSNUg5Pz8cPap · Post by **mCHSNUg5Pz8cPap** » Fri Feb 07, 2025 10:45 pm

My first post cited a link extolling the virtues of LLM OCR. In fairness, here is a blog post from a startup working on exactly this idea saying it is more difficult than many realize. https://www.runpulse.com/blog/why-llms-suck-at-ocr

My limited experience has been that LLM translation and OCR is really good compared to whatever PDF-XChange uses (I believe it's ABBYY). Using an LLM is probably about the same as using the best OCR offered by Google Cloud and MS (I'm thinking of the OCR built-into the Snipping Tool in Windows). I guess that makes sense because I'm pretty sure Google and MS use some form of AI.

Mon Feb 10, 2025 5:27 pm

Hello, mCHSNUg5Pz8cPap

mCHSNUg5Pz8cPap wrote: ↑Fri Feb 07, 2025 5:20 pm there is no need for licensing if you just create a connection

If only it were so simple, even the most open ended "you need to pay for it yourself" configurations when implemented at a business level, need a license in some capacity to avoid legal trouble. But that is not my area of expertise, nor exactly a topic of product "support", so lets move on from that discussion. If it happens it happens, is the best I can say there.

mCHSNUg5Pz8cPap wrote: ↑Fri Feb 07, 2025 10:45 pm LLM translation and OCR is really good compared to whatever PDF-XChange uses (I believe it's ABBYY)

We do use ABBYY's OCR engine for the Enhanced OCR processes, however it is worth noting that it is not at this point in any way intended or designed to work with handwritten text. Using an "LLM", which in this specific context would translate to implementing an "ICR" (Intelligent character recognition) engine, Is more than a small step up from the current "OCR" (Optical character recognition) engines/technology. "ICR" is incomparably more complex, and by extension much more demanding. Neither option for implementation is ideal either:

Running it locally - This would be a much slower process than the current engines are, and the file sizes for the local aspects of such an engine would be much larger than the current ABBYY OCR engine (which on its own is already the majority of the Editor's application size).
Running it remotely - This would likely be faster, and would allow for a licensing arrangement whereby mutual clients need a license for both our software, and the designated ICR engine. This however has the issue that offering it in any capacity is a data security risk, since you are directly sending your document data over the internet. It also means offline users simply cannot access this functionality at all.
Remote running also hits the snag of personal use options. Most services like this offer daily/momthly use limits based on complexity... One mis-click (which unfortunately many people are prone to) and you could have accidentally gone way over your limit with the given service, charging yourself a hefty sum more than you expected to. This once again goes back to legal risks with such systems.

Both have merits and demerits, and it is far too early in the discussions to offer any answers about plans, and all of the above is my personal speculation/research after the discussions we have had, not an official stance.
All I can say in an official capacity, is that any implementation in this area is likely multiple Years (not months) away, if it happens at all.

Kind regards,

mCHSNUg5Pz8cPap · Post by **mCHSNUg5Pz8cPap** » Tue Feb 11, 2025 3:25 am

PDF-XChange should do the same thing so many other startups are doing with AI to avoid the issues you mentioned, which is to spin up your own instance of a given AI model in Azure (Open AI and Meta models), AWS Bedrock (Anthropic models), or Google Vertex (Google models). You then have control over all the data so it would be covered by your terms and conditions (you should agree to only provide inference services and not retain any data; that's my opinion).
https://azure.microsoft.com/en-us/products/ai-model-catalog#Models
https://aws.amazon.com/bedrock/
https://cloud.google.com/vertex-ai

You modify the PDF-XChange software to include an optional AI tab. In it, the user can choose to use AI services for which you will charge them. Or, you allow users to enter their own API key so they can directly access the APIs of Open AI, Anthropic, and Google (the three big players in AI). Users can then do things like select an area on the page (as an image) an send it to the AI service where it is OCR'd and/or translated. Another option, for example, is to have an image of an entire page sent to an AI and then insert the OCR and/or translation result in as a new page immediately after the translated page (or insert all the pages at the end of the document).

Come to think of it, can javascript be used in PDF-XChange to call an external API? Maybe I can figure out a way to do this on my own.

Tue Feb 11, 2025 7:01 pm

Hello, mCHSNUg5Pz8cPap

There are certainly options yes, however this also requires dev team time, which comes at a premium, depending on the configuration, could require us to locally host a server which again falls into the data security and offline users issues, ignoring the ongoing maintenance costs to keep something powerful enough to serve millions of users smoothly.
No matter how we slice it, this is an extremely complicate topic, and is not something that will be decided by our discussion here on the forums. I assure you there is far more to consider than just what we have discussed so far. So as before, it would be something that if it comes, is still a very long ways away. I am sorry to say I cannot add any more to that discussion at this point.

In the meantime, our "snapshot" or "select and cut" tools, as well as the default windows snipping tool, would allow you to cut/copy a section of the page which you can then directly paste into google translate another processor which can grant those results if needed.

Regarding your JS question, I am fairly certain that it is not possible to do so in that way, at the very least, you may find even if it can call an external api, it may be difficult to work with page base content items, when JS cannot access or modify them directly.

Kind regards,

Jensen Head · Post by **Jensen Head** » Sun Feb 23, 2025 11:44 am

zesarux_lKCIbKmaiK.png

_
Anthropic Claude

Q: Before ZEsarUX 10.1, there was a pcspeaker driver which no longer exists now. What happened?
A: The pcspeaker driver has been replaced by the onespeakerdriver. This driver can now handle PC speaker output but also Raspberry Pi GPIO speaker, you can select the type from the menu or command line configuration settings

Q: How can I use the Raspberry Pi GPIO Speaker?
A: As the name suggests, you must connect a speaker to two GPI pins: one goes to GND and the other one to a pin number you want. I use a cheap piezoelectric speaker directly plugged to those pins. Then use the onebitspeaker driver, set type to Raspberry Pi GPIO. The default pin number is set to 22, but you can change it to any other you want.

Q: How can I enable the MIDI output feature?
A: You must select a machine with Audio Chip (Spectrum 128k, QL, MSX, Master System, etc) and go to menu Settings-> Audio-> Audio Chip to MIDI Output

ChatGPT

Q: Before ZEsarUX 10.1, there was a pcspeaker driver which no longer exists now. What happened?
A: The pcspeaker driver has been replaced by the onespeaker driver. This driver can now handle PC speaker output but also Raspberry Pi GPIO Speaker, you can select the type from the menu or command line configuration settings

Q: How can I use the Raspberry Pi GPIO Speaker?
A: As the name suggests, you must connect a speaker to two GPIO pins: one goes to GND and the other one to a pin number you want. I use a cheap piezoelectric speaker directly plugged to those pins. Then use the onebitspeaker driver, set type to Raspberry Pi GPIO. The default pin number is set to 22, but you can change it to any other you want.

Q: How can I enable the MIDI output feature?
A: You must select a machine with Audio Chip (Spectrum 128K, QL, MSX, Master System, etc) and go to menu -> Settings -> Audio -> Audio Chip to MIDI Output

Le Chat Mistral

Q: Before ZEsarUX 10.1, there was a PcSpeaker driver which no longer exists now. What happened?
A: The PcSpeaker driver has been replaced by the onespeakerdriver. This driver can now handle PC speaker output but also Raspberry Pi GPIO speaker. You can select the type from the menu or command line configuration settings

Q: How can I use the Raspberry Pi GPIO Speaker?
A: As the name suggests, you must connect a speaker to two GPIO pins: one goes to GND and the other one to a pin number you want. I use a cheap Piezoelectric speaker directly plugged to those pins. Then use the onebitspeaker driver, set type to Raspberry Pi GPIO. The default pin number is set to 22, but you can change it to any other you want.

Q: How can I enable the MIDI output feature?
A: You must select a machine with Audio Chip (Spectrum 128K, QL, MSX, Master System, etc) and go to menu Settings->Audio->Audio Chip to MIDI Output

ABBYY FineReader PDF 16.0.14.6584; part # 1435,6

O: Before ZEsarUX 10.1 . there was a Respeaker driver which no Longer exists now, Uhat happened^ A: The Respeaker driver has been replaced by the onespeakerdriver .
This driver can now handle PC Speaker output but also Raspberry Pi GPIO Speaker^ you can select the type from the menu or command line configuration settings
O: How can I use the Raspberry Pi GPIO Sp e a k e r 7
R: Rs the name suggests., you must connect a speaker to two GPI pins: one goes to GND and the other one to a pin number you want. I use a cheap piezoelectric speaker directly plugged to those pins. Then use the onebi tspeaker driver., set type to Raspberry Pi GPIO. The default pin number is set to 22? but you can change it to any other you want.
O: How can I enable the MIDI output featured
R: You must select a machine with Rudio Chip (Spectrum 12Sk ? uL .
MSX? Master System^ etc) and go to menu Settings-> Rudio-> Rudio Chip to MIDI output

PDF-XChange Editor Plus 10.5.2, build 395 (enhanced OCR)

X FRO
O: Before ZEsarUX 10.1, there was
a pcspeaker driver which no Longer exists now. What happened?
R: The pcspeaker driver has been replaced by the onespeakerdriver . This driver can now handle PC bpeaker output but a Lso Raspberry
Pi GPIu Speaker, you can select
the type from the menu or command Line configuration settings
O|UHmdQ.nr+On JQ0 f+[UTroTH-TODO"7J" T- I’D I’D “i I’D ro □ H
ro oramm moi
-|[TCLr+ nt) ID fl) W O C flUC C r+ t) O n LTlE
UZ r+ -nt) W t) H- III r+ r+t)
o wroroiCH-Dw zrron
c<c c m if ro if if
O - r+ r+t) N □ r+ □
E C r+O T -O C O W □ m to ro c ro s tJOiiH
□ nt) XI i£i IT ij ili =!<i
,-1- to H- UJ o Ijj III III 21to III C ■ ZUi lUri-nT in
t) ro tir+ row ro n □ crcr “ijZ to - c TCronr+HOD ijJ r+ toS“lr+OnCQ.rHD T □ rr - w o ro ro
ro i£ Ti ,-n.n > w
ro “I roT0to7r+r+ u Tito o ro □ ro s w to
H- H- H- t-T hl 111,-b O ■-.-1-1/1 ro ro o t)
Gin ro r+Gl*C rr r+WTI D-IHTTIO ro
o ro ro H'= n
r+On □ C “I 1
to - w -OS £
□ r+ <■ roonc
£O ro DDW u
to ro W r* H-
O : How can I enable
ture?
t select a
(Spe c t r u m r Sys tern,
n g s - > Rud i
to MIDI output
the MIDI machine with
128K , OL ,
etc) and go to
O — > Rud i O Ch i p
output f e a
R : You mus Rud i O Uh i p MSX, Haste menu Se 1 1 i

Conclusions
I read the messages above about the difficulties of implementing online OCR in terms of organizing the server part of the required performance and security, as well as implementing the resulting test back into the document. However, judging boldly by the results of my microtesting, the use of cloud recognition systems is inevitable when planning to improve the quality of OCR even for local applications. Actually, Adobe has come to this several years ago. And Microsoft before it. Maybe it's your turn?

At the same time, this will help fight pirates, depriving them of significant functionality in hacked applications that have turned into pumpkins.

Mon Feb 24, 2025 11:01 pm

Hello, Jensen Head

I appreciate the enthusiasm, To begin I should note that our settings do allow for a much better result, simply choose low accuracy for this file, it is still far from perfect, and inarguably worse than the others, but it is what we can offer here and now.

Back to the main topic... At no point did I say such tools are never going to come to our software, I am simply impressing the fact that they are no simple task, and unlike most of our competitors, our team sizes are exceedingly small. Any complex task means considerably less time spent on other items.

It is not a discussion which I can draw out any longer, nor something which we can offer any definitive statements relating to plans, implementations, licensing, or potential capabilities, at this time.

Kind regards,

mCHSNUg5Pz8cPap · Post by **mCHSNUg5Pz8cPap** » Thu Mar 06, 2025 8:40 pm

I'm going to continue posting LLM related OCR developments in this thread as a way of archiving the progression of the technology and/or latest developments. I'm not doing it to further advocate for its adoption, which means PDF-XChange team members should not feel obligated to respond to each post with an explanation of how the technology is complicated, PDF-XChange is a small team (honestly, this is probably the reason the software is so much better than the competitors), the decision is above your pay grade, etc.

With that said, I note that Mistral has just announced an LLM tailored to performing OCR including OCR on complicated PDF documents.
https://mistral.ai/fr/news/mistral-ocr
The examples are impressive. The only downside I see is that it doesn't look like the OCR model will be open source. However, they are willing to provide it to certain organizations to self-host.

The docs for this particular LLM OCR solution list the following key features (https://docs.mistral.ai/capabilities/document/):
Key features:
– Extracts text content while maintaining document structure and hierarchy
– Preserves formatting like headers, paragraphs, lists and tables
– Returns results in markdown format for easy parsing and rendering
– Handles complex layouts including multi-column text and mixed content
– Processes documents at scale with high accuracy
– Supports multiple document formats including PDF, images, and uploaded documents

Notably, providing the results in markdown should make it easier to render them as a pdf (e.g., select OCR document in PDF-XChange, pdf document is sent to OCR server, the result is provided in markdown and converted to PDF (likely at the server), and then returned as a separate document to PDF-XChange).

Killer Feature Request: LLM OCR and/or Translation

Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation

Re: Killer Feature Request: LLM OCR and/or Translation