How to read all the text from pdf document using PDFBox 2.0

Read All Text from PDF Document using PDFBox 2.0

In this tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.

PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.

Steps to Extract All Text from PDF

Following are the steps that are helpful in extracting the text from PDF document.

Step 1: Load PDF

Load the pdf file into PDDocument

</>

Copy

PDDocument doc = PDDocument.load(new File("sample.pdf"));

Step 2: Use PDFTextStripper.getText method

Get the text from doc using PDFTextStripper

</>

Copy

String text = new PDFTextStripper().getText(doc);

PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document.
getText returns the text of the pdf document.

Example 1 – Read All Text from PDF

In this example, we will take a PDF and read all the text present in PDF using PDFTextStripper.

ExtractText.java

</>

Copy

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractText {

	public static void main(String[] args) {
		try {
			PDDocument doc = PDDocument.load(new File("sample.pdf"));
			String text = new PDFTextStripper().getText(doc);;
			System.out.println("Text in PDF\n---------------------------------");
			System.out.println(text);
	        } catch (IOException e) {
			e.printStackTrace();
		}
	}
}

Output

Text in PDF
------------------
This is a sample PDF.

And pdf file used in the example is ? sample.pdf

Conclusion

In this PDFBox Tutorial, we have learnt to read all the text from pdf document using PDFBox 2.0.

TutorialKart

How to read all the text from pdf document using PDFBox 2.0

Read All Text from PDF Document using PDFBox 2.0

Steps to Extract All Text from PDF

Step 1: Load PDF

Step 2: Use PDFTextStripper.getText method

Example 1 – Read All Text from PDF

Conclusion

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning