Extract All Text
[Logo]
ICEsoft.org Forums: ICEfaces, ICEmobile, ICEpdf
[Search] Search   [Recent Topics] Recent Topics   [Groups] Home Page | www.icefaces.org  [Register] Register  [Login] Login 
Extract All Text  XML
Forum Index -> ICEpdf General
Author Message
Knuckle

Joined: 22/11/2008 00:00:00
Messages: 93
Offline


Hi ya

I wish to extract all the text from a pdf document.
As noted in the example, a page number is required.

Code:
   int pagNumber = 0;
   PageText pageText = document.getPageText(pagNumber);
 


Is it possible to programically obtain the number of pages in the document I wish to extract?


Cheers
Wayne

Sorry solved.

-> document.getNumberOfPages().

patrick.corless

Joined: 26/10/2004 00:00:00
Messages: 1097
Offline


You can iterate over the document pages like this:

Code:
for (int pageIndex = 0; pageIndex < document.getNumberOfPages();
              pageIndex++) {
     document.getPageText(pageIndex);
 }
[Email]
Knuckle

Joined: 22/11/2008 00:00:00
Messages: 93
Offline


Nice...

Thanks Patrick

Cheers
Wayne
Knuckle

Joined: 22/11/2008 00:00:00
Messages: 93
Offline


Hi Patrick

I am having problems extracting text from pdf documents that have been created by scanning and made searchable by OCR.

I am using a StringBuilder in combination with the code you supplied to extract the text but get an error.

Code:
         StringBuilder builder = new StringBuilder();
         for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
             PageText pageText = document.getPageText(pageIndex);
             if (pageText != null && pageText.getPageLines() != null) {
                 builder.append(pageText.toString());
             }
         }
         content = builder.toString();
 


java.lang.StringIndexOutOfBoundsException: String index out of range: 0

All other types of pdf's seem to be ok.

Cheers
Wayne
patrick.corless

Joined: 26/10/2004 00:00:00
Messages: 1097
Offline


Hi Wayne; Is there any chance you can post the file in question or the full stack trace of the exception?
[Email]
Knuckle

Joined: 22/11/2008 00:00:00
Messages: 93
Offline


Hi Patrick

When I try to attach a simple pdf example I get this error generated by the forum.

Code:
 An error has occurred.
 
 For detailed error information, please see the HTML source code, and contact the forum Administrator.
 
 /var/lib/jforum/upload/2010/4/1/3b3a66d1404466abb526424dc6768b4a_54273.pdf_ (Permission denied)
  
   
 Forum Index  
 


And unforntunatly I dont get a stack trace when the exception occurs, just this:
Code:
 ICEsoft ICEpdf Core 4.0.0 
 java.lang.StringIndexOutOfBoundsException: String index out of range: 0
 



When I replace the IcePdf parser with a different parser, I dont experience any problems when getting the text from this type of pdf file.
ie
Code:
 org.pdfbox.util.PDFTextStripper stripper = new org.pdfbox.util.PDFTextStripper();
 stripper.writeText(pdfDocument, writer);
 


Thanks
Wayne
Knuckle

Joined: 22/11/2008 00:00:00
Messages: 93
Offline


Example pdf as requested.

Cheers
Wayne
 Filename SimplePdfDocument.pdf [Disk] Download
 Description
 Filesize 197 Kbytes
 Downloaded:  61 time(s)

patrick.corless

Joined: 26/10/2004 00:00:00
Messages: 1097
Offline


Thanks for creating bug http://jira.icefaces.org/browse/PDF-170 for this issue. The fix is pretty straight forward and more detail can be found in JIRA.
[Email]
 
Forum Index -> ICEpdf General
Go to:   
Powered by JForum 2.1.7ice © JForum Team