Read All Text from PDF Document using PDFBox 2.0

In this tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.

PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.

Steps to Extract All Text from PDF

Following are the steps that are helpful in extracting the text from PDF document.

Step 1: Load PDF

Load the pdf file into PDDocument

PDDocument doc = PDDocument.load(new File("sample.pdf"));

Step 2: Use PDFTextStripper.getText method

Get the text from doc using PDFTextStripper

String text = new PDFTextStripper().getText(doc);

PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document. getText returns the text of the pdf document.

ADVERTISEMENT

Example 1 – Read All Text from PDF

In this example, we will take a PDF and read all the text present in PDF using PDFTextStripper.

ExtractText.java

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractText {

	public static void main(String[] args) {
		try {
			PDDocument doc = PDDocument.load(new File("sample.pdf"));
			String text = new PDFTextStripper().getText(doc);;
			System.out.println("Text in PDF\n---------------------------------");
			System.out.println(text);
	        } catch (IOException e) {
			e.printStackTrace();
		}
	}
}

Output

Text in PDF
------------------
This is a sample PDF.

And pdf file used in the example is ? sample.pdf

Conclusion

In this PDFBox Tutorial, we have learnt to read all the text from pdf document using PDFBox 2.0.