Extract Words from PDF Document using Apache PDFBox

To extract words from a PDF document in Java, you can use Apache PDFBox and extend PDFTextStripper. The idea is to let PDFBox read the text stream from each page, intercept each text line in writeString(String str, List<TextPosition> textPositions), and split that line into individual words.

This method works well for normal text-based PDF files where the text is embedded in the document. If the PDF contains scanned images of pages, there may be no selectable text to extract. In that case, you need OCR before word extraction.

The class org.apache.pdfbox.text.PDFTextStripper strips text from PDF pages. By overriding writeString(), we can collect words as PDFBox reads the content of the document.

PDFBox Word Extraction Logic

PDFBox does not directly return a separate list of words from a PDF file. It extracts text as strings. Each string passed to writeString() is usually a line or a text fragment. You can split that string using the word separator returned by getWordSeparator().

The basic flow is:

  • Load the PDF file using PDDocument.load().
  • Create a custom class that extends PDFTextStripper.
  • Set the page range to process.
  • Call writeText() to make PDFBox parse the document.
  • Override writeString() and split each extracted string into words.

Steps to Extract Words from PDF Document

Following is a step by step process to extract words from PDF using Apache PDFBox.

1. Extend PDFTextStripper for PDF Word Extraction

Create a Java Class and extend it with PDFTextStripper.

</>
Copy
public class GetWordsFromPDF extends PDFTextStripper {
  . . .
}

This custom class gives you access to the protected methods of PDFTextStripper, including writeString(), which is useful when you want to process text as it is extracted.

2. Call writeText to Read PDF Pages

Set page boundaries from the first page to the last page, and call the method writeText(). The writeText() method starts text extraction and internally calls writeString().

</>
Copy
PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

In the complete example below, the stripper object is created from GetWordsFromPDF. The dummy writer is used because the extracted words are collected inside the overridden writeString() method instead of being written to a text file.

3. Override writeString to Intercept PDF Text

The writeString() method receives extracted text as the first argument. PDFBox calls this method while processing the PDF document. By overriding it, you can decide how to handle each extracted text fragment.

</>
Copy
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

The second argument, textPositions, contains position details of the characters. In this tutorial, we only need the extracted string. If you need coordinates of text or characters, you can use the TextPosition objects.

4. Split Extracted PDF Text into Words

Split the string received by writeString() using the word separator. Each resulting value can be stored in a list, printed to the console, or written to a file.

For simple PDF files, splitting with getWordSeparator() is usually enough. For production code, you may also want to trim empty strings, normalize whitespace, remove punctuation, or handle hyphenated words based on your requirement.

Example 1 – Extract Words from PDF using PDFBox

In this example, we will take a PDF document and extract all words from this PDF using a custom PDFTextStripper class.

GetWordsFromPDF.java

</>
Copy
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
 
/**
* This is an example on how to extract words from PDF document
*/
public class GetWordsFromPDF extends PDFTextStripper {
    
    static List<String> words = new ArrayList<String>();
 
    public GetWordsFromPDF() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf"; // replace with your PDF file name
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetWordsFromPDF();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            // print words
            for(String word:words){
                System.out.println(word); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        String[] wordsInStream = str.split(getWordSeparator());
        if(wordsInStream!=null){
            for(String word :wordsInStream){
                words.add(word);
            }
        }
    }
}

Output

2017-8-6
Welcome
to
The
Apache
Software
Foundation!
Custom
Search
The
Apache
Way
(/foundation/governance/)

(http://apache.org/foundation/contributing.html)

Download the PDF document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Handling Empty Words and Extra Spaces in Extracted PDF Text

Some PDF files may contain multiple spaces, line breaks, tabs, or text fragments that produce empty words after splitting. If you want cleaner output, add a trim check before adding each word to the list.

</>
Copy
String[] wordsInStream = str.split(getWordSeparator());
for (String word : wordsInStream) {
    word = word.trim();
    if (!word.isEmpty()) {
        words.add(word);
    }
}

You can also replace getWordSeparator() with a regular expression such as "\\s+" when you want to split by one or more whitespace characters.

</>
Copy
String[] wordsInStream = str.split("\\s+");

Why PDFBox May Not Extract Words from Some PDFs

If the program prints no words, or the output looks incomplete, the issue is usually related to the PDF structure rather than the Java loop. Common reasons include scanned pages, unusual font encoding, protected files, text stored as vector drawings, or complex layouts such as tables and multi-column documents.

Issue in PDFWhat happens during word extractionPossible fix
Scanned PDFNo embedded text is available for PDFBox to read.Run OCR first, then extract words from the searchable PDF.
Protected PDFText extraction may fail or return limited content.Use a PDF that allows text extraction.
Custom font encodingExtracted words may contain incorrect characters.Test with another PDF or inspect the embedded fonts.
Multi-column layoutWords may appear in an unexpected order.Use setSortByPosition(true) and validate the output.
Tables and formsText may be extracted as fragments instead of clean rows.Use additional layout logic if row or column structure matters.

Extracting PDF Words vs Converting PDF Text to Word

This tutorial extracts words programmatically from a PDF file. It does not recreate the PDF layout, formatting, tables, or images in a Word document. If your requirement is to convert PDF text to Word format, you need a PDF-to-Word conversion tool or a separate document-generation step after extracting the text.

For Java applications, word extraction is useful when you want to index PDF content, count words, search for keywords, analyze text, or feed the extracted words into another processing pipeline.

Best Practices for Extracting Words from PDF in Java

  • Always close the PDDocument after processing the PDF file.
  • Use setSortByPosition(true) when reading documents where visual reading order matters.
  • Trim each extracted word before storing or printing it.
  • Skip empty strings after splitting text.
  • Test the code with different PDF types, including simple text PDFs, tables, and multi-column documents.
  • Use OCR first if the PDF is scanned and does not contain selectable text.

FAQs on Extracting Words from PDF using PDFBox

Why can’t I extract words from a PDF using PDFBox?

You may not be able to extract words if the PDF is scanned, protected, or does not contain embedded text. PDFBox reads text objects from the PDF. If the page is only an image, OCR is required before extracting words.

Can PDFBox extract text from scanned PDF files?

PDFBox can extract embedded text from a PDF, but it does not perform OCR by itself. For scanned PDFs, first convert the scanned pages into searchable text using OCR, and then use PDFBox to extract words.

How do I extract words from all pages of a PDF in Java?

Set the start page and end page on PDFTextStripper, then call writeText(). To process every page, set the end page to document.getNumberOfPages().

Why is the extracted word order different from the PDF display order?

PDF files store text based on internal drawing instructions, not always in normal reading order. Calling setSortByPosition(true) can improve the order, but complex layouts may still need custom handling.

Can I extract word coordinates from a PDF using PDFBox?

Yes. The writeString() method also receives a list of TextPosition objects. You can use those objects to inspect character positions and build word-level coordinate logic.

QA Checklist for PDFBox Word Extraction Tutorial

  • Confirm that the Java example uses a text-based PDF, not a scanned image-only PDF.
  • Check that PDDocument is closed after text extraction.
  • Verify that empty strings are handled if the output is used in production code.
  • Test whether setSortByPosition(true) improves word order for the sample PDF.
  • Confirm that OCR is mentioned when explaining why text cannot be extracted from scanned PDFs.

Extract Words from PDF with PDFBox: Summary

In this Apache PDFBox Tutorial, we have learnt to extract words from PDF by extending PDFTextStripper and overriding writeString(). You may also refer extract coordinates or position of characters in PDF if you want to work with character positions in a PDF document.