I use the icepdf core to detect the position of the words inside pdf documents and I have trouble with one of these documents.
The problem is that the line width is reported as 100 px length, since the width of the page is 400px. The position of each word inside line it's norrowed inside 100px.
I mention that in Acrobat Reader this lines with problems are displayed correctly with ~ 380 px in length.
For some pages, the information is located correctly but for another ones not. The page 21 is with problem.
If anybody can help me I will thank him in advance.
Thanks for sending in the file. I've given it a good look and think I've identified the problem.
We've been seeing more and more PDF's that are generated like this one where the original scan is visible but an OCR technology was used to write out a layer for text extraction.
What seems to be happening is that we are substituting incorrectly the fonts for the OCR layer with a font that doesn't have the same width as the one used to generate the PDF. I've attached a screen shot which introduces an alpha value into the renderting stack so you can see the OCR text behind the image text.
I'm on the road this week but I'll see what I can to get a bug to track this issue as well as a fix.
Hello Patrick and thank you for your response!
You are right, the document was scanned and the text was extracted with OCR. For other pdf-s with the same characteristics everything it's ok but for this one (and other more the same like this one) is this problem with text and image layers which does not mach.
I'm looking forward for your bug fix.
Hello Patrick!
I saw that meantime was released icepdf 4.1 but unfortunately did not fixed my problem yet.
Any chance to be fixed in next release or in another patch?
Thanks a lot!
I've created bug http://jira.icefaces.org/browse/PDF-200 to track this issue. I'll be looking more closely this weekend and will let you know once I figure it out.