Extract Words from PDF Document

To extract words from PDF document (from all the pages), we shall override writeText method of PDFTextStripper.

The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.

To extract extract words from PDF document, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List<TextPosition> textPositions) method.

The first argument to writeString method is a line. This line could be split to words using word separator.

Steps to Extract Words from PDF Document

Following is a step by step process to extract words from pdf :

1. Extend PDFTextStripper

Create a Java Class and extend it with PDFTextStripper.

public class GetWordsFromPDF extends PDFTextStripper {
  . . .
}

2. Call writeText method

Set page boundaries (from first page to last page) to strip text and call the method writeText().

PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

3. Override writeString

writeString method receives a line of text as the first argument. writeString method is called for each line of text in the PDF document.

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

4. Get Words

Split the string received by writeString method by word separator.

ADVERTISEMENT

Example 1 – Extract Words from PDF

In this example, we will take a PDF document, and extract all words from this PDF.

GetWordsFromPDF.java

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
 
/**
* This is an example on how to extract words from PDF document
*/
public class GetWordsFromPDF extends PDFTextStripper {
    
    static List<String> words = new ArrayList<String>();
 
    public GetWordsFromPDF() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf"; // replace with your PDF file name
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetWordsFromPDF();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            // print words
            for(String word:words){
                System.out.println(word); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String[] wordsInStream = str.split(getWordSeparator());
        if(wordsInStream!=null){
            for(String word :wordsInStream){
                words.add(word);
            }
        }
    }
}

Output

2017-8-6
Welcome
to
The
Apache
Software
Foundation!
Custom
Search
The
Apache
Way
(/foundation/governance/)

(http://apache.org/foundation/contributing.html)

Download the PDF document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion

In this Apache PDFBox Tutorial, we have learnt to extract words from PDF. You may also refer extract coordinates or position of characters in PDF.