Extract Words from PDF Document using Apache PDFBox
To extract words from a PDF document in Java, you can use Apache PDFBox and extend PDFTextStripper. The idea is to let PDFBox read the text stream from each page, intercept each text line in writeString(String str, List<TextPosition> textPositions), and split that line into individual words.
This method works well for normal text-based PDF files where the text is embedded in the document. If the PDF contains scanned images of pages, there may be no selectable text to extract. In that case, you need OCR before word extraction.
The class org.apache.pdfbox.text.PDFTextStripper strips text from PDF pages. By overriding writeString(), we can collect words as PDFBox reads the content of the document.
PDFBox Word Extraction Logic
PDFBox does not directly return a separate list of words from a PDF file. It extracts text as strings. Each string passed to writeString() is usually a line or a text fragment. You can split that string using the word separator returned by getWordSeparator().
The basic flow is:
- Load the PDF file using
PDDocument.load(). - Create a custom class that extends
PDFTextStripper. - Set the page range to process.
- Call
writeText()to make PDFBox parse the document. - Override
writeString()and split each extracted string into words.
Steps to Extract Words from PDF Document
Following is a step by step process to extract words from PDF using Apache PDFBox.
1. Extend PDFTextStripper for PDF Word Extraction
Create a Java Class and extend it with PDFTextStripper.
public class GetWordsFromPDF extends PDFTextStripper {
. . .
}
This custom class gives you access to the protected methods of PDFTextStripper, including writeString(), which is useful when you want to process text as it is extracted.
2. Call writeText to Read PDF Pages
Set page boundaries from the first page to the last page, and call the method writeText(). The writeText() method starts text extraction and internally calls writeString().
PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
In the complete example below, the stripper object is created from GetWordsFromPDF. The dummy writer is used because the extracted words are collected inside the overridden writeString() method instead of being written to a text file.
3. Override writeString to Intercept PDF Text
The writeString() method receives extracted text as the first argument. PDFBox calls this method while processing the PDF document. By overriding it, you can decide how to handle each extracted text fragment.
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
. . .
}
The second argument, textPositions, contains position details of the characters. In this tutorial, we only need the extracted string. If you need coordinates of text or characters, you can use the TextPosition objects.
4. Split Extracted PDF Text into Words
Split the string received by writeString() using the word separator. Each resulting value can be stored in a list, printed to the console, or written to a file.
For simple PDF files, splitting with getWordSeparator() is usually enough. For production code, you may also want to trim empty strings, normalize whitespace, remove punctuation, or handle hyphenated words based on your requirement.
Example 1 – Extract Words from PDF using PDFBox
In this example, we will take a PDF document and extract all words from this PDF using a custom PDFTextStripper class.
GetWordsFromPDF.java
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract words from PDF document
*/
public class GetWordsFromPDF extends PDFTextStripper {
static List<String> words = new ArrayList<String>();
public GetWordsFromPDF() throws IOException {
}
/**
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException {
PDDocument document = null;
String fileName = "apache.pdf"; // replace with your PDF file name
try {
document = PDDocument.load( new File(fileName) );
PDFTextStripper stripper = new GetWordsFromPDF();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
// print words
for(String word:words){
System.out.println(word);
}
}
finally {
if( document != null ) {
document.close();
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String[] wordsInStream = str.split(getWordSeparator());
if(wordsInStream!=null){
for(String word :wordsInStream){
words.add(word);
}
}
}
}
Output
2017-8-6
Welcome
to
The
Apache
Software
Foundation!
Custom
Search
The
Apache
Way
(/foundation/governance/)
(http://apache.org/foundation/contributing.html)
Download the PDF document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.
Handling Empty Words and Extra Spaces in Extracted PDF Text
Some PDF files may contain multiple spaces, line breaks, tabs, or text fragments that produce empty words after splitting. If you want cleaner output, add a trim check before adding each word to the list.
String[] wordsInStream = str.split(getWordSeparator());
for (String word : wordsInStream) {
word = word.trim();
if (!word.isEmpty()) {
words.add(word);
}
}
You can also replace getWordSeparator() with a regular expression such as "\\s+" when you want to split by one or more whitespace characters.
String[] wordsInStream = str.split("\\s+");
Why PDFBox May Not Extract Words from Some PDFs
If the program prints no words, or the output looks incomplete, the issue is usually related to the PDF structure rather than the Java loop. Common reasons include scanned pages, unusual font encoding, protected files, text stored as vector drawings, or complex layouts such as tables and multi-column documents.
| Issue in PDF | What happens during word extraction | Possible fix |
|---|---|---|
| Scanned PDF | No embedded text is available for PDFBox to read. | Run OCR first, then extract words from the searchable PDF. |
| Protected PDF | Text extraction may fail or return limited content. | Use a PDF that allows text extraction. |
| Custom font encoding | Extracted words may contain incorrect characters. | Test with another PDF or inspect the embedded fonts. |
| Multi-column layout | Words may appear in an unexpected order. | Use setSortByPosition(true) and validate the output. |
| Tables and forms | Text may be extracted as fragments instead of clean rows. | Use additional layout logic if row or column structure matters. |
Extracting PDF Words vs Converting PDF Text to Word
This tutorial extracts words programmatically from a PDF file. It does not recreate the PDF layout, formatting, tables, or images in a Word document. If your requirement is to convert PDF text to Word format, you need a PDF-to-Word conversion tool or a separate document-generation step after extracting the text.
For Java applications, word extraction is useful when you want to index PDF content, count words, search for keywords, analyze text, or feed the extracted words into another processing pipeline.
Best Practices for Extracting Words from PDF in Java
- Always close the
PDDocumentafter processing the PDF file. - Use
setSortByPosition(true)when reading documents where visual reading order matters. - Trim each extracted word before storing or printing it.
- Skip empty strings after splitting text.
- Test the code with different PDF types, including simple text PDFs, tables, and multi-column documents.
- Use OCR first if the PDF is scanned and does not contain selectable text.
FAQs on Extracting Words from PDF using PDFBox
Why can’t I extract words from a PDF using PDFBox?
You may not be able to extract words if the PDF is scanned, protected, or does not contain embedded text. PDFBox reads text objects from the PDF. If the page is only an image, OCR is required before extracting words.
Can PDFBox extract text from scanned PDF files?
PDFBox can extract embedded text from a PDF, but it does not perform OCR by itself. For scanned PDFs, first convert the scanned pages into searchable text using OCR, and then use PDFBox to extract words.
How do I extract words from all pages of a PDF in Java?
Set the start page and end page on PDFTextStripper, then call writeText(). To process every page, set the end page to document.getNumberOfPages().
Why is the extracted word order different from the PDF display order?
PDF files store text based on internal drawing instructions, not always in normal reading order. Calling setSortByPosition(true) can improve the order, but complex layouts may still need custom handling.
Can I extract word coordinates from a PDF using PDFBox?
Yes. The writeString() method also receives a list of TextPosition objects. You can use those objects to inspect character positions and build word-level coordinate logic.
QA Checklist for PDFBox Word Extraction Tutorial
- Confirm that the Java example uses a text-based PDF, not a scanned image-only PDF.
- Check that
PDDocumentis closed after text extraction. - Verify that empty strings are handled if the output is used in production code.
- Test whether
setSortByPosition(true)improves word order for the sample PDF. - Confirm that OCR is mentioned when explaining why text cannot be extracted from scanned PDFs.
Extract Words from PDF with PDFBox: Summary
In this Apache PDFBox Tutorial, we have learnt to extract words from PDF by extending PDFTextStripper and overriding writeString(). You may also refer extract coordinates or position of characters in PDF if you want to work with character positions in a PDF document.
TutorialKart.com