Extract images from pdf using PDFBox

We can extract images from a PDF using PDFBox.

In this tutorial, we shall learn to extract images from pdf using PDFBox and save the images to local.

Steps to Extract Images from PDF using PDFBox

Following is a step by step process to extract images from pdf using PDFBox.

1. Extend PDFStreamEngine

Create a Java Class and extend it with PDFStreamEngine.

public class GetImageLocationsAndSize extends PDFStreamEngine

2. Call processPage()

For each of the pages in PDF document, call the method processPage(page).

for( PDPage page : document.getPages() ) {
	pageNum++;
	printer.processPage(page);
}

3. Override processOperator()

For each of the object in PDF page, processOperator is called in processPage(). We shall override processOperator().

@Override
protected void processOperator( Operator operator, List operands) throws IOException{
	. . .
}

4. Check for Image

Check if the object that has been sent to processOperator() is an image object.

if( xobject instanceof PDImageXObject){
	. . .
}

5. Save the image to local

If the object is an image object, get the BufferedImage and save it to local. Using PDImageXObject.getImage() we get a BufferedImage of type ARGB.

BufferedImage bImage = image.getImage();
ImageIO.write(bImage,"PNG",new File("image_name.png"));
ADVERTISEMENT

Example 1 – Extract Images from PDF using PDFBox

In this example, we will take a PDF, and extract all images from this PDF using PDFBox processOperator() method.

SaveImagesInPdf.java

import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.PDFStreamEngine;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.List;

import javax.imageio.ImageIO;

/**
 * This is an example on how to extract images from pdf.
 */
public class SaveImagesInPdf extends PDFStreamEngine
{
	/**
	 * Default constructor.
	 *
	 * @throws IOException If there is an error loading text stripper properties.
	 */
	public SaveImagesInPdf() throws IOException
	{
	}

	public int imageNumber = 1;

	/**
	 * @param args The command line arguments.
	 *
	 * @throws IOException If there is an error parsing the document.
	 */
	public static void main( String[] args ) throws IOException
	{
		PDDocument document = null;
		String fileName = "apache.pdf";
		try
		{
			document = PDDocument.load( new File(fileName) );
			SaveImagesInPdf printer = new SaveImagesInPdf();
			int pageNum = 0;
			for( PDPage page : document.getPages() )
			{
				pageNum++;
				System.out.println( "Processing page: " + pageNum );
				printer.processPage(page);
			}
		}
		finally
		{
			if( document != null )
			{
				document.close();
			}
		}
	}

	/**
	 * @param operator The operation to perform.
	 * @param operands The list of arguments.
	 *
	 * @throws IOException If there is an error processing the operation.
	 */
	@Override
	protected void processOperator( Operator operator, List<COSBase> operands) throws IOException
	{
		String operation = operator.getName();
		if( "Do".equals(operation) )
		{
			COSName objectName = (COSName) operands.get( 0 );
			PDXObject xobject = getResources().getXObject( objectName );
			if( xobject instanceof PDImageXObject)
			{
				PDImageXObject image = (PDImageXObject)xobject;

				// same image to local
				BufferedImage bImage = image.getImage();
				ImageIO.write(bImage,"PNG",new File("image_"+imageNumber+".png"));
				System.out.println("Image saved.");
				imageNumber++;

			}
			else if(xobject instanceof PDFormXObject)
			{
				PDFormXObject form = (PDFormXObject)xobject;
				showForm(form);
			}
		}
		else
		{
			super.processOperator( operator, operands);
		}
	}

}

Output

Processing page: 1
Image saved.
Image saved.
Image saved.
Processing page: 2
Image saved.
Image saved.
Processing page: 3
Processing page: 4

Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion

In this Apache PDFBox Tutorial, we have learnt to extract images from pdf using PDFBox and save the BufferedImage of type ARGB to local using PDFStreamEngine class.