Read All Text from PDF Document using PDFBox 2.0
In this tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.
PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.
Apache PDFBox is a Java library used to create, read, and work with PDF documents. For text extraction, the commonly used class is org.apache.pdfbox.text.PDFTextStripper. It reads the text content that is already available in the PDF page content stream. If the PDF is a scanned image, PDFTextStripper cannot perform OCR by itself.
You can also refer to the official PDFBox documentation for PDFTextStripper and the PDFBox 2.0 command line tools when you want to compare Java code with built-in PDFBox utilities.
When PDFBox 2.0 can extract text from a PDF
PDFBox can extract text when the PDF contains selectable text. This is common in PDFs generated from Word documents, HTML pages, reports, invoices, or other digital sources. The extracted text may not always look exactly like the visual layout because a PDF stores text by drawing positions, not as a simple paragraph document.
- Use
PDFTextStripperfor normal searchable PDFs. - Use page range settings when you need text only from selected pages.
- Use sorting by position when layout order matters for columns or tables.
- Use an OCR tool before PDFBox when the PDF contains only scanned images.
PDFBox 2.0 Maven dependency for reading PDF text
If PDFBox is not already added to your Java project, include the PDFBox dependency in your Maven pom.xml. Use the latest compatible PDFBox 2.0.x version approved for your project.
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.x</version>
</dependency>
For a Gradle project, add the PDFBox dependency in the dependencies block.
implementation 'org.apache.pdfbox:pdfbox:2.0.x'
Steps to Extract All Text from PDF using PDFBox 2.0
Following are the steps that are helpful in extracting the text from PDF document.
Step 1: Load PDF into PDDocument
Load the pdf file into PDDocument
PDDocument doc = PDDocument.load(new File("sample.pdf"));
PDDocument represents the loaded PDF file. In production code, close the document after reading it. The simplest way is to use a try-with-resources block so that the PDF file is closed even if an exception occurs.
Step 2: Extract text using PDFTextStripper.getText
Get the text from doc using PDFTextStripper
String text = new PDFTextStripper().getText(doc);
PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document.
getText returns the text of the pdf document.
For basic text extraction, this is enough. If you need text from only a few pages, or if the page has columns, headings, or table-like alignment, configure the PDFTextStripper object before calling getText().
Example 1 – Read All Text from PDF
In this example, we will take a PDF and read all the text present in PDF using PDFTextStripper.
ExtractText.java
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class ExtractText {
public static void main(String[] args) {
try {
PDDocument doc = PDDocument.load(new File("sample.pdf"));
String text = new PDFTextStripper().getText(doc);;
System.out.println("Text in PDF\n---------------------------------");
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output
Text in PDF
------------------
This is a sample PDF.
And pdf file used in the example is ? sample.pdf
Safer PDFBox 2.0 Java example using try-with-resources
The previous example shows the basic idea. In a real Java program, prefer try-with-resources so that the loaded PDF document is closed automatically after text extraction.
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class ExtractTextSafely {
public static void main(String[] args) {
File file = new File("sample.pdf");
try (PDDocument document = PDDocument.load(file)) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println(text);
} catch (IOException e) {
System.err.println("Could not read text from PDF: " + e.getMessage());
}
}
}
This version is better for file handling because the PDDocument resource is closed automatically. It also separates the PDFTextStripper object into its own variable, which makes it easier to add page range and sorting options later.
Extract text from selected PDF pages using PDFTextStripper
Sometimes you may not want text from every page. PDFTextStripper allows you to set the starting and ending page before calling getText(). Page numbers start from 1, not 0.
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class ExtractTextFromPages {
public static void main(String[] args) {
try (PDDocument document = PDDocument.load(new File("sample.pdf"))) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
String text = stripper.getText(document);
System.out.println(text);
} catch (IOException e) {
e.printStackTrace();
}
}
}
This approach is useful when you want to extract text from the first page of a report, a fixed page range in a PDF form, or a small section of a large document.
Improve reading order for PDF text with columns or positioned text
PDF files may store text in a different order from what you see on the page. For documents with columns, tables, or text placed at different coordinates, try enabling position-based sorting.
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
This does not convert a PDF table into a structured table. It only asks PDFBox to sort extracted text by page position. For complex layouts, you may need custom parsing logic, coordinates, or a separate PDF table extraction strategy.
Why PDFBox does not extract text from scanned PDF files
A scanned PDF often contains images of pages instead of real text. In that case, PDFTextStripper may return an empty string or very little text because there is no selectable text layer to extract. To read text from a scanned PDF, first run OCR with an OCR tool, then use PDFBox on the searchable PDF output if required.
- If you can select and copy text from the PDF viewer, PDFBox can usually extract it.
- If the PDF page behaves like a photo, OCR is required before text extraction.
- If only some pages are scanned, PDFBox may extract text from searchable pages and return little or nothing for scanned pages.
Common PDFBox text extraction issues and fixes
- Empty output: Check whether the PDF is scanned or password-protected.
- Text appears in wrong order: Try
setSortByPosition(true)and inspect the PDF layout. - Missing spaces: PDF text positioning can affect spacing; test with the actual PDFs used by your application.
- Tables are not preserved: PDFTextStripper extracts text, not table structure.
- Large PDF is slow: Extract only the required page range when possible.
- File remains locked: Use try-with-resources or close the
PDDocumentafter reading.
PDFBox command line option for quick text extraction testing
Before debugging Java code, you can also test whether PDFBox can read the text from a PDF using the PDFBox command line tools. The exact command depends on how PDFBox is installed in your environment.
java -jar pdfbox-app-2.0.x.jar ExtractText sample.pdf sample.txt
If the command line tool also produces empty output, the issue is likely with the PDF content itself, such as scanned pages or missing text layer, rather than your Java code.
QA checklist for this PDFBox text extraction program
- Test with a normal searchable PDF and confirm that text is printed.
- Test with a scanned PDF and confirm that the program handles empty output safely.
- Check whether the PDF document is closed after extraction.
- Use
setStartPage()andsetEndPage()when only a page range is needed. - Try
setSortByPosition(true)for column-based or layout-heavy PDFs. - Do not expect PDFTextStripper to preserve exact fonts, colors, table borders, or visual formatting.
FAQs on reading PDF text using PDFBox 2.0
Which PDFBox class is used to extract text from a PDF in Java?
The commonly used class is org.apache.pdfbox.text.PDFTextStripper. Load the PDF into PDDocument, create a PDFTextStripper object, and call getText(document).
Can PDFBox extract text from a scanned PDF?
PDFBox does not perform OCR through PDFTextStripper. If the PDF contains only scanned page images, run OCR first and then extract text from the searchable PDF output.
How do I extract text from only one page using PDFBox?
Use setStartPage(pageNumber) and setEndPage(pageNumber) on the PDFTextStripper object before calling getText(). PDFBox page numbers for this API start at 1.
Why is PDFBox extracted text not in the same order as the PDF page?
A PDF stores text using drawing positions. The stored order may differ from the visual reading order. Try setSortByPosition(true), but complex layouts may still need custom parsing.
Does PDFTextStripper preserve PDF formatting?
No. PDFTextStripper is mainly for text extraction. It does not preserve exact fonts, colors, images, borders, or table structure from the PDF page.
Conclusion: reading all PDF text with PDFBox 2.0
In this PDFBox Tutorial, we have learnt to read all the text from pdf document using PDFBox 2.0. The main steps are to load the PDF using PDDocument, extract text using PDFTextStripper, and close the document after reading. For page-specific extraction, set the page range before calling getText(). For scanned PDFs, use OCR before attempting PDFBox text extraction.
TutorialKart.com