Icepdf does not detect correct position of the word inside document
[Logo]
ICEsoft.org Forums: ICEfaces, ICEmobile, ICEpdf
[Search] Search   [Recent Topics] Recent Topics   [Groups] Home Page | www.icefaces.org  [Register] Register  [Login] Login 
Icepdf does not detect correct position of the word inside document  XML
Forum Index -> ICEpdf General
Author Message
andreiweb

Joined: 17/02/2010 00:00:00
Messages: 4
Offline


I use the icepdf core to detect the position of the words inside pdf documents and I have trouble with one of these documents.
The problem is that the line width is reported as 100 px length, since the width of the page is 400px. The position of each word inside line it's norrowed inside 100px.

I mention that in Acrobat Reader this lines with problems are displayed correctly with ~ 380 px in length.

For some pages, the information is located correctly but for another ones not. The page 21 is with problem.

If anybody can help me I will thank him in advance.


The pdf with the problem it's located at this address:
http://develop.ime.ro:8080/toread/pdfs/V0001.pdf

LE: you can verify this problem more easy using Icepdf with java web start from http://www.icepdf.org/demo/jws/icepdf.jnlp
after that file->Open Url-> http://develop.ime.ro:8080/toread/pdfs/V0001.pdf
Go to Page 21, chose text select tool and use it to select the text from that page. (From Adobe Reader all text is selected correctly)



The code that I use to locate word position is this:

(Sorry, it's a problem with the [ code ] tag and I put the code as it is.)

Document document = new Document();

File filePath = new File("file.pdf");
document.setFile(filePath.getCanonicalPath());

PageText pageText;
StringBuilder pageBuffer = null;
for (int pageIdx = 0; pageIdx < document.getNumberOfPages(); pageIdx++) {

pageBuffer = new StringBuilder();
pageText = document.getPageText(pageIdx);

float pageHeight = document.getPageDimension(pageIdx, 0f).getHeight();
float pageWidth = document.getPageDimension(pageIdx, 0f).getWidth();

pageBuffer.append("pageWidth="+pageWidth+";");
pageBuffer.append("pageHeight="+pageHeight+";");

System.out.println("page:"+pageIdx+" pageWidth:"+pageWidth+" pageHeight:"+pageHeight);

int lineIdx = 0;
for (Object olineText : pageText.getPageLines()) {

LineText lineText = (LineText) olineText;

System.out.println("line:"+lineIdx+" x: "+returnWithPrecision(lineText.getBounds().getX())+" maxx: "+returnWithPrecision(lineText.getBounds().getMaxX())+" y: "+returnWithPrecision(lineText.getBounds().getY())+" width: "+returnWithPrecision(lineText.getBounds().getWidth())+" height: "+returnWithPrecision(lineText.getBounds().getHeight()));

for (Object owordText : lineText.getWords()){


System.out.println("wordX:"+((WordText)owordText).getBounds().getX()+" wordY:"+((WordText)owordText).getBounds().getY());

}

lineIdx++;

}

}


In Ice PDF select with Font Engine ON:


In Adobe Reader select:

patrick.corless

Joined: 26/10/2004 00:00:00
Messages: 1097
Online


Thanks for sending in the file. I've given it a good look and think I've identified the problem.

We've been seeing more and more PDF's that are generated like this one where the original scan is visible but an OCR technology was used to write out a layer for text extraction.

What seems to be happening is that we are substituting incorrectly the fonts for the OCR layer with a font that doesn't have the same width as the one used to generate the PDF. I've attached a screen shot which introduces an alpha value into the renderting stack so you can see the OCR text behind the image text.

I'm on the road this week but I'll see what I can to get a bug to track this issue as well as a fix.
[Thumb - snapshot.png]
 Filename snapshot.png [Disk] Download
 Description
 Filesize 216 Kbytes
 Downloaded:  54 time(s)

[Email]
andreiweb

Joined: 17/02/2010 00:00:00
Messages: 4
Offline


Hello Patrick and thank you for your response!
You are right, the document was scanned and the text was extracted with OCR. For other pdf-s with the same characteristics everything it's ok but for this one (and other more the same like this one) is this problem with text and image layers which does not mach.
I'm looking forward for your bug fix.
andreiweb

Joined: 17/02/2010 00:00:00
Messages: 4
Offline


Hello Patrick!
I saw that meantime was released icepdf 4.1 but unfortunately did not fixed my problem yet.
Any chance to be fixed in next release or in another patch?
Thanks a lot!
patrick.corless

Joined: 26/10/2004 00:00:00
Messages: 1097
Online


I've created bug http://jira.icefaces.org/browse/PDF-200 to track this issue. I'll be looking more closely this weekend and will let you know once I figure it out.
[Email]
 
Forum Index -> ICEpdf General
Go to:   
Powered by JForum 2.1.7ice © JForum Team