reading order

PDF-XChange Viewer SDK for Developer's
(ActiveX and Simple DLL Versions)

Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Ivan - Tracker Software, Stefan - PDF-XChange

fabrizio
User
Posts: 33
Joined: Fri Jul 02, 2010 1:58 pm

reading order

Post by fabrizio »

Hi, we're developing a vocal reader and evaluating your PDF-XChange ActiveX component to integrate in our application for PDF reading. Extracting text in a well-tagged and accessible PDF file works just fine, but we would try to give some reading-order even with untagged documents. Extracting text in such a documents may result in a bad ordered text paragraphs (like in the attached sample). The text order from the ActiveX component is the same we can achieve using Adobe Acrobat extraction tool, which is quite good. But, is there a way to get (and then order) paragraph boxes (I mean contiguos text area in the document) in the page?
Something like:
DoVerb("Documents[0].Pages[0].Text.Paragraphs[0].Quad", "Get", vDataIn, vDataOut)
I searched in the ActiveX SDK manual but can't find a good solution for that... The idea is then to apply a simple ordering algorithm to the list (top-bottom an left right).

For example, extracting text from untagged document (bad order):

Code: Select all

Emma and Ben are excited. Emma doesn’t 
want to go to Scott’s party now! 
We’re in the car park 
at Heathrow Airport 
in London...
Culture 
Spot 
Disney World, in Orlando, is the number
I was wandering if it's possible to get additional informations like this:

Code: Select all

[Box Coords: x1,y1,x2,y2...]
Emma and Ben are excited. Emma doesn’t 
want to go to Scott’s party now! 
[Box Coords: x1,y1,x2,y2...]
We’re in the car park 
at Heathrow Airport 
in London...
[Box Coords: x1,y1,x2,y2...]
Culture 
Spot 
Disney World, in Orlando, is the number
And then applying some ordering (and get the right read-order):

Code: Select all

We’re in the car park 
at Heathrow Airport 
in London...
Emma and Ben are excited. Emma doesn’t 
want to go to Scott’s party now! 
Culture 
Spot 
Disney World, in Orlando, is the number
Is this possible?

Thank you in advance (apologize for my bad english :P )
Fabrizio
You do not have the required permissions to view the files attached to this post.
User avatar
Vasyl - PDF-XChange
Site Admin
Posts: 2448
Joined: Thu Jun 30, 2005 4:11 pm

Re: reading order

Post by Vasyl - PDF-XChange »

Hi Fabrizio.

Sorry, no simple way for this in current implementation. The standard PDF does not contain any info about paragraphs, lines, words..

But, in the new build you will be able to use the additional info about text lines (collected by our internal text-builder), this information can be used for reordering lines to the visual order.
Look to [Reference/Named Items/Documents/Item/Pages/Item/Text/Lines] section in the new help file.
Wait to the next build...

Best
Regards.
PDF-XChange Co. LTD (Project Developer)

Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.