Dear support team,
First, let me give you some context about what we're trying to achieve here.
I know we can do that with the ActiveX Viewer SDK, but for licensing reasons, we can't use it. So that is not an option for my company.
We have been evaluating several PDF manipulation and rendering SDKs and we successfully managed to implement all of our requirement with yours, except for two features. Time is running out and we'll need to make a decision soon (most probably by the end of next week).
The 2 features are the following:
1. Given a rectangular selection area, copy all 'selected' characters to the clipboard (yes, copying to the clipboard itself is a C# one-liner)
2. Given a rectangular selection area, highlight all of the 'selected' characters
For highlighting, we may decide to cheat on that one (draw a transparent rectangle on top), but I suspect we can successfully use the low-level API to directly manipulate the values we want.
So for now, we need to test and verify that it's possible to retrieve the entire list of text objects selected, and for each of those, the entire list of character ranges.
For now, I'm successfully calling PXCp_ET_GetElement to retrieve all text elements in a page (the text, the matrix, etc), but I'm not really sure how to correctly compute the bounding box of:
1. Each text element (complete character range)
2. Each character in that element
Would it be possible to get some assistance or some pointers in that area?
Note 1: I'm using the PDF-Tools SDK 4.0.206
Note 2: The C# struct definition of PXP_TextElement in the file XCPro40_Declares.cs (used by the method PXCp_ET_GetElement) isn't defined correctly, see the attached archive for the corrected version
Thanks in advance.
Finding/highlighting text objects in a given rectangle
Moderators: PDF-XChange Support, Daniel - PDF-XChange, Chris - PDF-XChange, Sean - PDF-XChange, Vasyl - PDF-XChange, Stefan - PDF-XChange
-
gschell
- User
- Posts: 4
- Joined: Thu Nov 01, 2012 2:10 pm
Finding/highlighting text objects in a given rectangle
You do not have the required permissions to view the files attached to this post.
-
Lzcat - Tracker Supp
- Site Admin
- Posts: 677
- Joined: Thu Jun 28, 2007 8:42 am
Re: Finding/highlighting text objects in a given rectangle
Hi gschell.
HTH.
Actually there is no highlight possibility in pdf content itself (of course you can modify content to simulate this), one can only Highlight annotations. Unfortunately PDF-Tools SDK does not provide the Hi-Level API required for such annotations or content modifications, so this will be not be a trivial task. However if you do not need a permanent highlight such a 'cheat' is a good possible solution.gschell wrote:For highlighting, we may decide to cheat on that one (draw a transparent rectangle on top), but I suspect we can successfully use the low-level API to directly manipulate the values we want.
As for text positions you have a matrix which will position and scale text in a text element. In addition you should calculate a bounding parallelogram for each character using following algorithm:gschell wrote:1. Each text element (complete character range)
2. Each character in that element
- 1. Calculate bounding rect for character (Characters[n]). Left is Offsets[n] from PXP_TextElement structure, right is Offsets[n+ 1]. Top is Ascent / 1000 * FontSize + Rise, bottom is Descent / 1000 * FontSize + Rise (Ascent and Descent) are taken from PXP_TEFontInfo structure for font FontIndex)
- 2. Translate each of four rectangle corners using a Matrix. In general you will receive a parallelogram, but in the case when b and c (or a and d) coefficients of Matrix are both zero - you will get a rectangle, so it is enough then to use only two diagonal points to calculate this rectangle.
HTH.
Victor
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
Tracker Software
Project manager
Please archive any files posted to a ZIP, 7z or RAR file or they will be removed and not posted.
-
gschell
- User
- Posts: 4
- Joined: Thu Nov 01, 2012 2:10 pm
Re: Finding/highlighting text objects in a given rectangle
Hi,
Thank you very much, I got it working very quickly thanks to your detailed explanation!
As a note, I am transforming the coordinates of the 4 points (top left, top right, bottom left, bottom right). I then proceed to take the min/max of the 4 transformed points. The results are the coordinates of the smallest enclosing AABB for the character (or character range).
(AABB means axis-aligned bounding box, a term that's used a lot in collision detection and computer graphics)
For others reading this, make sure you set up the mask member variable of the PXP_TextElement structure so you correctly get the matrix, font info and character offsets when you call PXCp_ET_GetElement.
I am now fairly confident that my boss will be looking for licensing details with your company next week
Thank you very much, I got it working very quickly thanks to your detailed explanation!
As a note, I am transforming the coordinates of the 4 points (top left, top right, bottom left, bottom right). I then proceed to take the min/max of the 4 transformed points. The results are the coordinates of the smallest enclosing AABB for the character (or character range).
(AABB means axis-aligned bounding box, a term that's used a lot in collision detection and computer graphics)
For others reading this, make sure you set up the mask member variable of the PXP_TextElement structure so you correctly get the matrix, font info and character offsets when you call PXCp_ET_GetElement.
I am now fairly confident that my boss will be looking for licensing details with your company next week
-
Paul - PDF-XChange
- Site Admin
- Posts: 7463
- Joined: Wed Mar 25, 2009 10:37 pm
Re: Finding/highlighting text objects in a given rectangle
Best regards
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
Paul O'Rorke
PDF-XChange Support
http://www.pdf-xchange.com
-
afalsow
- User
- Posts: 17
- Joined: Thu Jun 14, 2012 1:28 pm
Re: Finding/highlighting text objects in a given rectangle
This post is directed at gschell...
Your task/challenge here is in some ways similar to my own, and I thought I might post in the hope you may be able to offer some insight.
My situation is that I wish to extract text which is already annotated (highlighted or underlined).
I have successfully iterated into the document using low-level api and have reached the point where I have identified the particular objects of interest (ie: Subtype.name = "Highlight"), and I should be able to have the four (4) coordinates of the bounding rectangle ("Rect") in a short time. I see there is also a key "QuadPoints" (which appears to consist of 10 such "quads" in my test document)... but I have not yet researched QuadPoints in the adobe documentation.
I am uncertain from there, however, how to take obtained coordinates and identify the contents (text elements?) that fall within that rectangle and then extract the associated text.
One thing that has me particularly puzzled is that - as is clear from experimentation - a user may highlight beginning/ending mid-word. Coupling that with the fact that the GetElement returns the character phrases in (seemingly) any order, and of various lengths, how might I "apply" that rectangle in the extraction process as it may begin/end in the middle of words (and thus "middle" of text elements)?
I am wondering if I must extract each character - individually - checking its specific offset(s), matrix, etc. (sorry, I have not explored the PXP_TextElement structure in great depth just yet, so please forgive errors). That approach seems burdensome, so I wonder if I am missing something. I have driven down further into the object, but have not seen anything moderately obvious.
Are there any quick hints you might be able to offer from your own success?
Thank you.
Your task/challenge here is in some ways similar to my own, and I thought I might post in the hope you may be able to offer some insight.
My situation is that I wish to extract text which is already annotated (highlighted or underlined).
I have successfully iterated into the document using low-level api and have reached the point where I have identified the particular objects of interest (ie: Subtype.name = "Highlight"), and I should be able to have the four (4) coordinates of the bounding rectangle ("Rect") in a short time. I see there is also a key "QuadPoints" (which appears to consist of 10 such "quads" in my test document)... but I have not yet researched QuadPoints in the adobe documentation.
I am uncertain from there, however, how to take obtained coordinates and identify the contents (text elements?) that fall within that rectangle and then extract the associated text.
One thing that has me particularly puzzled is that - as is clear from experimentation - a user may highlight beginning/ending mid-word. Coupling that with the fact that the GetElement returns the character phrases in (seemingly) any order, and of various lengths, how might I "apply" that rectangle in the extraction process as it may begin/end in the middle of words (and thus "middle" of text elements)?
I am wondering if I must extract each character - individually - checking its specific offset(s), matrix, etc. (sorry, I have not explored the PXP_TextElement structure in great depth just yet, so please forgive errors). That approach seems burdensome, so I wonder if I am missing something. I have driven down further into the object, but have not seen anything moderately obvious.
Are there any quick hints you might be able to offer from your own success?
Thank you.
-
afalsow
- User
- Posts: 17
- Joined: Thu Jun 14, 2012 1:28 pm
Re: Finding/highlighting text objects in a given rectangle
On additional research, it appears it's possible to build a function (based on the Adobe SDK docs) that accepts, as arguments, an Annotation and a page number, and returns an array of all words on that page that intersect the quad space of the annot in question. Such a function looks like this:
function getHighlightedWords( annot, pagenumber ) {
var annotQuads = annot.quads[0];
var highlightedWords = new Array;
// test every word on the page
for (var i = 0; i < getPageNumWords(pagenumber); i++) {
var q = getPageNthWordQuads( pagenumber ,i )[0];
if ( q[1] == annotQuads[1])
if ( q[0] >= annotQuads[0] &&
q[6] <= annotQuads[6] )
highlightedWords.push(getPageNthWord( pagenumber ,i ));
}
return highlightedWords;
}
But I do not seem to see any "shortcut" low-level commands in PDF Tools to emulate this function.
function getHighlightedWords( annot, pagenumber ) {
var annotQuads = annot.quads[0];
var highlightedWords = new Array;
// test every word on the page
for (var i = 0; i < getPageNumWords(pagenumber); i++) {
var q = getPageNthWordQuads( pagenumber ,i )[0];
if ( q[1] == annotQuads[1])
if ( q[0] >= annotQuads[0] &&
q[6] <= annotQuads[6] )
highlightedWords.push(getPageNthWord( pagenumber ,i ));
}
return highlightedWords;
}
But I do not seem to see any "shortcut" low-level commands in PDF Tools to emulate this function.
-
John - Tracker Supp
- Site Admin
- Posts: 5225
- Joined: Tue Jun 29, 2004 10:34 am
Re: Finding/highlighting text objects in a given rectangle
There is no 'high level' API shortcut for sure and whilst it is possible using our low level API in principle - it could not be called a shortcut and the coding required would be considerable.
However - you could consider using our Viewer ActiveX and this should be possible to achieve using Javascript fairly easily - however do be aware the license model for the Viewer ActiveX is different from the PDF-Tools SDK - you pay a license fee for every end user installation of any application you enable with our ActiveX - though the cost is modest to say the least...
More info here;
https://www.pdf-xchange.com/product ... ctivex-sdk
However - you could consider using our Viewer ActiveX and this should be possible to achieve using Javascript fairly easily - however do be aware the license model for the Viewer ActiveX is different from the PDF-Tools SDK - you pay a license fee for every end user installation of any application you enable with our ActiveX - though the cost is modest to say the least...
More info here;
https://www.pdf-xchange.com/product ... ctivex-sdk
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
-
John - Tracker Supp
- Site Admin
- Posts: 5225
- Joined: Tue Jun 29, 2004 10:34 am
Re: Finding/highlighting text objects in a given rectangle
Hi,
I had a quick 'run through' your task with one of the project team and on testing Javascript on the current build would not provide the required results - however - in an update in December it will function as required and he insists would actually outperform Adobe for the same task - using the code snippet below...
HTH
I had a quick 'run through' your task with one of the project team and on testing Javascript on the current build would not provide the required results - however - in an update in December it will function as required and he insists would actually outperform Adobe for the same task - using the code snippet below...
Code: Select all
function getHighlightedWords(annot, pagenumber)
{
var highlightedWords = new Array;
for (var j = 0; j < annot.quads.length; j++)
{
var annotQuads = annot.quads[j];
// test every word on the page
for (var i = 0; i < getPageNumWords(pagenumber); i++)
{
var t = getPageNthWordQuads(pagenumber, i );
if (t == null) continue;
var q = t[0];
if ((q[1] == annotQuads[1]) && (q[0] >= annotQuads[0] && q[6] <= annotQuads[6]))
highlightedWords.push(getPageNthWord( pagenumber ,i ));
}
}
return highlightedWords;
}If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com
-
gschell
- User
- Posts: 4
- Joined: Thu Nov 01, 2012 2:10 pm
Re: Finding/highlighting text objects in a given rectangle
Hi afalsow.
I haven't touched on the low-level API yet, I've been too busy with other projects. So I can't comment on that part, but you seem to have it covered already
You take the 4 end-points, compute the min/max, you get the "highlight AABB".
Here's how I would handle the character matching part :
Hope that helps!
I haven't touched on the low-level API yet, I've been too busy with other projects. So I can't comment on that part, but you seem to have it covered already
You take the 4 end-points, compute the min/max, you get the "highlight AABB".
Here's how I would handle the character matching part :
- When you load a page, iterate through all text element and pre-compute their AABBs (if it's super fast, you can pre-compute the AABB of every character in each element, else do it on demand).
- When you need to extract the highlighted stuff, you first do a collision test of the highlight AABB with every text element's AABB. You keep those passing the test (i.e. intersecting) in a text element list.
- For every text element in that list, you do a per-character collision test, and you push into a character list the characters (the unicode character + its AABB) that are either fully enclosed or just intersecting the highlight AABB (depends on what you want, I guess).
- Then, you sort that list by the left X coordinate of each character's AABB, and then by the top Y coordinate (use a stable sorting algorithm for that).
- You now have a sorted list of all highlighted characters, from top to bottom, left to right.
- Concatenate them and you have your highlighted string.
Hope that helps!
-
John - Tracker Supp
- Site Admin
- Posts: 5225
- Joined: Tue Jun 29, 2004 10:34 am
Re: Finding/highlighting text objects in a given rectangle
Thanks gschell - your input is most appreciated 
If posting files to this forum - you must archive the files to a ZIP, RAR or 7z file or they will not be uploaded - thank you.
Best regards
Tracker Support
http://www.tracker-software.com
Best regards
Tracker Support
http://www.tracker-software.com