Extract Text Line by Line from PDF using PDFBox

In this tutorial, we shall learn how to extract text line by line from PDF document from all the pages.

There are two methods. The first is by using writeText() method of of PDFTextStripper and the second way to use getText() method of PDFTextStripper.

Method 1 – Use PDFTextStripper.getText()

You may use the getText method of PDFTextStripper that has been used in extracting text from pdf. Then splitting the text string using new line delimiter gives the lines of PDF document.

You may have to wait for the program until it reads all of the document, strip all text, then split the whole text line by line.

If you would like to process the line as soon as it is fetched, the following method is a better option.

ADVERTISEMENT

Method 2 – Use PDFTextStripper.writeString()

The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.

To extract text line by line from PDF document using PDFBox, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List<TextPosition> textPositions) method.

The first argument to writeString method is a line. This line could be split to words using word separator.

Extract Words from PDF Document

Following is a step by step process to extract text line by line from PDF.

1. Extend PDFTextStripper

Create a Java Class and extend it with PDFTextStripper.

public class GetWordsFromPDF extends PDFTextStripper {
  . . .
}

2. Call writeText method

Set page boundaries (from first page to last page) to strip text and call the method writeText.

PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

3. Override writeString

writeString method receives a line of text as the first argument, which is what we need.

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

Example 1 – Extract Text Line by Line from PDF using Apache PDFBox

GetLinesFromPDF.java

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
 
/**
* This is an example on how to extract text line by line from pdf document
*/
public class GetLinesFromPDF extends PDFTextStripper {
    
    static List<String> lines = new ArrayList<String>();
 
    public GetLinesFromPDF() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf";
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetLinesFromPDF();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            // print lines
            for(String line:lines){
                System.out.println(line); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        lines.add(str);
        // you may process the line here itself, as and when it is obtained
    }
}

Output

2017-8-6
Welcome to The Apache Software Foundation!
Custom Search
The Apache Way (/foundation/governance/)
 (http://apache.org/foundation/contributing.html)
Contribute (https://community.apache.org/contributors/)
ASF Sponsors (/foundation/thanks.html)
OPEN.
THE APACHE SOFTWARE FOUNDATION

Download the PDF document here apache.pdf, if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion

In this PDFBox Tutorial, we have learnt to extract text line by line from PDF. You may also refer to how we extract words from PDF document.