Add Language field to Text Properties

Jensen Head · Post by **Jensen Head** » Wed Jan 01, 2025 4:23 pm

This is an important element of document accessibility.

Ideally, in the case of document recognition using several languages, the OCR engine should itself assign language tags to the created text objects, and in the case of recognizing the entire document, also to the document, depending on the predominant language of the document or the first page with text (I believe this should be discussed separately) .

Currently, the following set of fields are displayed in Text Properties:

§ Style
Fill Color

Stroke Color

Border Width

Rendering Mode

Font

Font Size
§ Font Details
Name

Embedded

Type

Encoding

Actual Font

Actual Font Type

Object Number

Thus, it is not possible to set the language of selected text elements.

[ℹ] Related links

1. Accessibility and Usability at Penn State / Language Tagging / PDF (2015) — accessibility.psu.edu/foreignlanguages/langtag/#pdf

2. W3C / Techniques for WCAG 2.0 (Techniques and Failures for Web Content Accessibility Guidelines 2.0) / PDF19: Specifying the language for a passage or phrase with the Lang entry in PDF documents — w3.org/TR/WCAG20-TECHS/PDF19.html

3.

Set the secondary language for the text (…). This is done on each tag that is in the secondary language. Select the tag / Right-click / Properties, and choose the language from the drop-down menu. If the language isn't in the menu, type in the language code listed here at ISO 639-2

community.adobe.com/t5/acrobat-discussions/pdf-accesibles-en-dos-idiomas-se-puede-hacer/m-p/13765225#bodyDisplay_2d908c73d37e97

Thu Jan 02, 2025 11:40 am

Hello Jensen Head,

Thanks for your comment! I will ask our devs working on accessibility to take a look at this post and your suggestions. I can not at this point make any promises as to if or when such a feature might be available in our products.

p.s. It appears like it is possible to specify the Language tag for paragraphs for accessibility:

image_2025_01_02T22_04_09_768Z.png

And I've created a ticket for the FR part of your post - so that such tags could be added by the OCR engines in the future as well:
#7250: OCR: Add accessibility tags

Kind regards,
Stefan

Jensen Head · Post by **Jensen Head** » Wed May 28, 2025 8:42 am

Stefan - PDF-XChange wrote: ↑Thu Jan 02, 2025 11:40 amIt appears like it is possible to specify the Language tag for paragraphs for accessibility

I can use the Select / Text command to select all text objects in a document to assign a language to them in bulk:
͏

PDFXEdit (2025-05-28 11-32-21).png

͏
But I can't seem to do the same with paragraphs. Should I make a separate feature request for this, or am I satisfied with this feature request in this thread?

Also, some users may find it more convenient to check the Apply selected to all text objects in the document checkbox in the Reading Options block of the Advanced tab of the Document Properties dialog box. However, this should be coordinated with the functionality of setting up multiple languages for a document (relevant for terminological translation dictionaries, bilingual books for foreign language learners, and user manuals).

Wed May 28, 2025 5:18 pm

Hello, Jensen Head

A "paragraph" does not actually exist in a PDF, we just do a very good job of pretending they do, and the new accessibility tags offer it as a "block" object to aid screen-readers with their content ordering. Practically speaking, no paragraph of text in a PDF document has ever been a single congruent object, and the spec does not seem likely to change this.

Page text objects (which can sometimes be a single floating point character) are not intended to contain this language data. The "tags" method Stefan mentioned above are the "accessibility" features your links refer to, but it is not possible to add these to the base content.

The Request Stefan made is for the OCR engine to automatically create those "tagged" areas, and attempt to assign the language automatically, based on what it has detected. No special data will be present in the actual text content, but all would be added to this secondary "accessibility tag" region that is created.

As for defining multiple languages in the document properties; as we have mentioned before, that is not the intent of the document properties, and will not be changing at this time. If the Specification changes to indicate it should be a common case, we will reconsider it then.

Kind regards,

Jensen Head · Post by **Jensen Head** » Wed May 28, 2025 9:24 pm

Stefan, Daniel, thank you for your help! Thanks to you, I now understand that accessibility functionality is an add-on to the basic entities of content, a kind of layer, an intermediary between the content and the specialized tool for reading the document. And the terms "block", "paragraph", "tags" and their properties are not the content itself, but allow you to do things with it that would be difficult or impossible to do without them.

I think I figured out how to assign language properties to text objects. In fact, my approach is wrong now, since it should be part of a comprehensive work on creating the document tag structure. In addition to all text blocks receiving the correct Paragraph tag, their hierarchical arrangement should reflect the actual structure of the document. This means that headings, subheadings and the main text should be correctly nested inside each other, despite using the same tag type. Also, keep in mind that if the document contains tables, lists or images, their tags (Table, L, Figure) do not need to be converted to Paragraph, but processed in the same way, making sure that the order of traversing the blocks on each page corresponds to the reading order ("Reading Order").

In my case, I do it this way, correct me if I'm doing something fundamentally wrong:

1. "The document has no tags. Create Tags Root to continue tagging document".
2. New Tag / Type: P, Title: <language name>.
3. Edit Objects / Edit Text / Select all blocks of one language with Ctrl.
4. "Create Tag from Selection" (the selected paragraph blocks are added as nested tags to the tag created above).
5. Repeat steps 2 through 4 for other document languages.
6. Repeat steps 2 through 4 for other data types for all document languages.

I haven't figured out yet how to automatically remove empty <Span> containers, how to automatically move all objects from XForm containers to the root, how to delete tags without objects (the Delete Empty Tags command doesn't work). Also, I don't understand yet how to get to the tag (or tags) associated with this object from a paragraph block in the page space or from the Content Pane. But these are not related to the topic, which can be considered closed.

As for defining multiple languages in the document properties — maybe then it would be more accurate to call this property not "Language", but "Primary language of the document" (or something like that), specifying that this is the language of the title page, or cover, or output data of the document. This will remove all questions about what to do with this document tag in the case of multilingual documents.

PS. and, as always, forgive me for my lack of understanding and speech impediment, I know English very poorly.

Wed May 28, 2025 11:32 pm

Hello, Jensen Head

Yes, you are essentially on the right path now. It is worth noting that most objects in a document do not need to have a language set. "Default" means that they will inherit the "language" from the document properties (This is part of why you should only have a single language present there). You only need to specify a language for content tags which are on content not written in the Document's "primary" language. You later suggested changing the name of this option, but it is currently only "language"[singular], and this seems to be the general presentation offered by many other apps as well, so it is unlikely to change.

Regarding your latter questions about xforms, I believe these need to remain in place, as they hold some of the flags for tagging. In essence, a tagged PDF will look much more complex in the content panel. Part of this is necessity, since it is as before, designed to help other applications to interpret the file content, and to communicate accurately with other computers, you need to leave nothing to the imagination (after all, that is something most computers I know tend to lack at this point in time).

In essence, configuring tags will be very difficult in most situations. We are essentially giving a human the tools to try to explain to a computer, in words the computer will understand, how it should explain the content to another human.

Kind regards,

Add Language field to Text Properties SOLVED

Add Language field to Text Properties

Re: Add Language field to Text Properties

Re: Add Language field to Text Properties

Re: Add Language field to Text Properties SOLVED

Re: Add Language field to Text Properties

Re: Add Language field to Text Properties