Read All Text from PDF Document using PDFBox 2.0

In this tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.

PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.

Apache PDFBox is a Java library used to create, read, and work with PDF documents. For text extraction, the commonly used class is org.apache.pdfbox.text.PDFTextStripper. It reads the text content that is already available in the PDF page content stream. If the PDF is a scanned image, PDFTextStripper cannot perform OCR by itself.

You can also refer to the official PDFBox documentation for PDFTextStripper and the PDFBox 2.0 command line tools when you want to compare Java code with built-in PDFBox utilities.

When PDFBox 2.0 can extract text from a PDF

PDFBox can extract text when the PDF contains selectable text. This is common in PDFs generated from Word documents, HTML pages, reports, invoices, or other digital sources. The extracted text may not always look exactly like the visual layout because a PDF stores text by drawing positions, not as a simple paragraph document.

  • Use PDFTextStripper for normal searchable PDFs.
  • Use page range settings when you need text only from selected pages.
  • Use sorting by position when layout order matters for columns or tables.
  • Use an OCR tool before PDFBox when the PDF contains only scanned images.

PDFBox 2.0 Maven dependency for reading PDF text

If PDFBox is not already added to your Java project, include the PDFBox dependency in your Maven pom.xml. Use the latest compatible PDFBox 2.0.x version approved for your project.

</>
Copy
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.x</version>
</dependency>

For a Gradle project, add the PDFBox dependency in the dependencies block.

</>
Copy
implementation 'org.apache.pdfbox:pdfbox:2.0.x'

Steps to Extract All Text from PDF using PDFBox 2.0

Following are the steps that are helpful in extracting the text from PDF document.

Step 1: Load PDF into PDDocument

Load the pdf file into PDDocument

</>
Copy
PDDocument doc = PDDocument.load(new File("sample.pdf"));

PDDocument represents the loaded PDF file. In production code, close the document after reading it. The simplest way is to use a try-with-resources block so that the PDF file is closed even if an exception occurs.

Step 2: Extract text using PDFTextStripper.getText

Get the text from doc using PDFTextStripper

</>
Copy
String text = new PDFTextStripper().getText(doc);

PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document.
getText returns the text of the pdf document.

For basic text extraction, this is enough. If you need text from only a few pages, or if the page has columns, headings, or table-like alignment, configure the PDFTextStripper object before calling getText().

Example 1 – Read All Text from PDF

In this example, we will take a PDF and read all the text present in PDF using PDFTextStripper.

ExtractText.java

</>
Copy
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractText {

	public static void main(String[] args) {
		try {
			PDDocument doc = PDDocument.load(new File("sample.pdf"));
			String text = new PDFTextStripper().getText(doc);;
			System.out.println("Text in PDF\n---------------------------------");
			System.out.println(text);
	        } catch (IOException e) {
			e.printStackTrace();
		}
	}
}

Output

Text in PDF
------------------
This is a sample PDF.

And pdf file used in the example is ? sample.pdf

Safer PDFBox 2.0 Java example using try-with-resources

The previous example shows the basic idea. In a real Java program, prefer try-with-resources so that the loaded PDF document is closed automatically after text extraction.

</>
Copy
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractTextSafely {

    public static void main(String[] args) {
        File file = new File("sample.pdf");

        try (PDDocument document = PDDocument.load(file)) {
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            System.out.println(text);
        } catch (IOException e) {
            System.err.println("Could not read text from PDF: " + e.getMessage());
        }
    }
}

This version is better for file handling because the PDDocument resource is closed automatically. It also separates the PDFTextStripper object into its own variable, which makes it easier to add page range and sorting options later.

Extract text from selected PDF pages using PDFTextStripper

Sometimes you may not want text from every page. PDFTextStripper allows you to set the starting and ending page before calling getText(). Page numbers start from 1, not 0.

</>
Copy
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractTextFromPages {

    public static void main(String[] args) {
        try (PDDocument document = PDDocument.load(new File("sample.pdf"))) {
            PDFTextStripper stripper = new PDFTextStripper();
            stripper.setStartPage(1);
            stripper.setEndPage(2);

            String text = stripper.getText(document);
            System.out.println(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

This approach is useful when you want to extract text from the first page of a report, a fixed page range in a PDF form, or a small section of a large document.

Improve reading order for PDF text with columns or positioned text

PDF files may store text in a different order from what you see on the page. For documents with columns, tables, or text placed at different coordinates, try enabling position-based sorting.

</>
Copy
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);

This does not convert a PDF table into a structured table. It only asks PDFBox to sort extracted text by page position. For complex layouts, you may need custom parsing logic, coordinates, or a separate PDF table extraction strategy.

Why PDFBox does not extract text from scanned PDF files

A scanned PDF often contains images of pages instead of real text. In that case, PDFTextStripper may return an empty string or very little text because there is no selectable text layer to extract. To read text from a scanned PDF, first run OCR with an OCR tool, then use PDFBox on the searchable PDF output if required.

  • If you can select and copy text from the PDF viewer, PDFBox can usually extract it.
  • If the PDF page behaves like a photo, OCR is required before text extraction.
  • If only some pages are scanned, PDFBox may extract text from searchable pages and return little or nothing for scanned pages.

Common PDFBox text extraction issues and fixes

  • Empty output: Check whether the PDF is scanned or password-protected.
  • Text appears in wrong order: Try setSortByPosition(true) and inspect the PDF layout.
  • Missing spaces: PDF text positioning can affect spacing; test with the actual PDFs used by your application.
  • Tables are not preserved: PDFTextStripper extracts text, not table structure.
  • Large PDF is slow: Extract only the required page range when possible.
  • File remains locked: Use try-with-resources or close the PDDocument after reading.

PDFBox command line option for quick text extraction testing

Before debugging Java code, you can also test whether PDFBox can read the text from a PDF using the PDFBox command line tools. The exact command depends on how PDFBox is installed in your environment.

</>
Copy
java -jar pdfbox-app-2.0.x.jar ExtractText sample.pdf sample.txt

If the command line tool also produces empty output, the issue is likely with the PDF content itself, such as scanned pages or missing text layer, rather than your Java code.

QA checklist for this PDFBox text extraction program

  • Test with a normal searchable PDF and confirm that text is printed.
  • Test with a scanned PDF and confirm that the program handles empty output safely.
  • Check whether the PDF document is closed after extraction.
  • Use setStartPage() and setEndPage() when only a page range is needed.
  • Try setSortByPosition(true) for column-based or layout-heavy PDFs.
  • Do not expect PDFTextStripper to preserve exact fonts, colors, table borders, or visual formatting.

FAQs on reading PDF text using PDFBox 2.0

Which PDFBox class is used to extract text from a PDF in Java?

The commonly used class is org.apache.pdfbox.text.PDFTextStripper. Load the PDF into PDDocument, create a PDFTextStripper object, and call getText(document).

Can PDFBox extract text from a scanned PDF?

PDFBox does not perform OCR through PDFTextStripper. If the PDF contains only scanned page images, run OCR first and then extract text from the searchable PDF output.

How do I extract text from only one page using PDFBox?

Use setStartPage(pageNumber) and setEndPage(pageNumber) on the PDFTextStripper object before calling getText(). PDFBox page numbers for this API start at 1.

Why is PDFBox extracted text not in the same order as the PDF page?

A PDF stores text using drawing positions. The stored order may differ from the visual reading order. Try setSortByPosition(true), but complex layouts may still need custom parsing.

Does PDFTextStripper preserve PDF formatting?

No. PDFTextStripper is mainly for text extraction. It does not preserve exact fonts, colors, images, borders, or table structure from the PDF page.

Conclusion: reading all PDF text with PDFBox 2.0

In this PDFBox Tutorial, we have learnt to read all the text from pdf document using PDFBox 2.0. The main steps are to load the PDF using PDDocument, extract text using PDFTextStripper, and close the document after reading. For page-specific extraction, set the page range before calling getText(). For scanned PDFs, use OCR before attempting PDFBox text extraction.