Category: PDFBox

How to extract words from PDF document

Apache PDFBox Tutorial – We shall learn how to extract words from PDF document (from all the pages) using writeText method of PDFTextStripper.

The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.

To extract extract words from PDF document, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List textPositions) method.

The first argument to writeString method is a line. This line could be split to words using word separator.

Extract words from pdf document

Following is a step by step process to extract words from pdf :

  1. Extend PDFTextStripper

    Create a Java Class and extend it with PDFTextStripper.

  2. Call writeText method

    Set page boundaries (from first page to last page) to strip text and call the method writeText.

  3. Override writeString

    writeString method receives a line of text as the first argument. writeString method is called for each line of text in the PDF document.

  4. Get Words

    Split the string received by writeString method by word separator.

Example Java Program to extract words from PDF

Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion :

In this Apache PDFBox Tutorial, we have learnt to extract words from PDF. You may also refer extract coordinates or position of characters in PDF.

How to extract coordinates or position of characters in PDF – PDFBox

Apache PDFBox Tutorial – We shall learn how to extract coordinates or position of characters in PDF from all the pages using PDFTextStripper.

The class org.apache.pdfbox.contentstream.PDFTextStripper strips out all of the text.

To extract coordinates or location and size of characters in pdf, we shall extend this PDFTextStripper class, intercept and implement writeString(String string, List textPositions) method.

TextPosition contains information regarding the character like its Unicode, X coordinate, Y coordinate, height, width, x-scaling value, y-scaling value, font size, space width, etc.

Extract coordinates or position of characters in pdf

Following is a step by step process to extract coordinates or position of characters in pdf :

  1. Extend PDFTextStripper

    Create a Java Class and extend it with PDFTextStripper.

  2. Call writeText method

    Set page boundaries (from first page to last page) to strip text and call the method writeText.

  3. Override writeString

    writeString method receives information about the text positions of characters in a stream. We shall override writeString method as shown below.

  4. Print Locations and Size

    For each item in list of TextPosition which is for an individual character, print the coordinates and size.

Example Java Program to extract coordinates or position of characters in PDF

Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Conclusion :

In this Apache PDFBox Tutorial, we have learnt to extract coordinates or position of characters in PDF document and also a way to extract Unicode, X coordinate, Y coordinate, height, width, x-scaling value, y-scaling value, font size, space width, etc..

How to extract images from pdf using PDFBox

In this Apache PDFBox Tutorial, we shall learn to extract images from pdf using PDFBox and save the images to local.

Extract images from pdf using PDFBox

Following is a step by step process to extract images from pdf using PDFBox :

  1. Extend PDFStreamEngine

    Create a Java Class and extend it with PDFStreamEngine.

  2. Call processPage()

    For each of the pages in PDF document, call the method processPage(page).
    asd

  3. Override processOperator()

    For each of the object in PDF page, processOperator is called in processPage(). We shall override processOperator().

  4. Check for Image

    Check if the object that has been sent to processOperator() is an image object.

  5. Save the image to local

    If the object is an image object, get the BufferedImage and save it to local. Using PDImageXObject.getImage() we get a BufferedImage of type ARGB.

Example Java program to extract images from pdf using PDFBox :

Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.
 

Conclusion :

In this Apache PDFBox Tutorial, we have learnt to extract images from pdf using PDFBox and save the BufferedImage of type ARGB to local using PDFStreamEngine class.

How to get co-ordinates or location and size of images in PDF using PDFBox

Apache PDFBox Tutorial – We shall learn how to get co-ordinates or location and size of images in pdf from all the pages using PDFStreamEngine.

The class org.apache.pdfbox.contentstream.PDFStreamEngine handles and executes some of the operations in processing a PDF document by providing a callback interface.

To get co-ordinates or location and size of images in pdf, we shall extend this PDFStreamEngine class, intercept and implement processOperator( Operator operator, List<COSBase> operands) method.

COSBase is the base class that all objects in the PDF document will extend.

For each object in the PDF document, the above mentioned method processOperator() is called in PDFStreamEngine.processPage(page). For each of the object in PDF document, we shall check if the object is an image object and get its properties like (X,Y) co-ordinates and size.

Get co-ordinates or location and size of images in pdf

Following is a step by step process to get co-ordinates or location and size of images in pdf :

  1. Extend PDFStreamEngine

    Create a Java Class and extend it with PDFStreamEngine.

  2. Call processPage()

    For each of the pages in PDF document, call the method processPage(page).

  3. Override processOperator()

    For each of the object in PDF page, processOperator is called in processPage(). We shall override processOperator().

  4. Check for Image

    Check if the object that has been sent to processOperator() is an image object.

  5. Print Locations and Size

    If the object is an image object, print the locations and size of the image.

Example Java Program to get location and size of images in pdf

Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Raw Size vs Displayed Size

The size of image displayed in the pdf could be different from the actual size of original (or raw) image.

(X,Y) location of image in PDF

Left bottom corner of image is the (X,Y) location that we get from PDFBox tool.

(X,Y) location of image in PDF - Get co-ordinates or location and size of images in pdf - Apache PDFBox Tutorial - www.tutorialkart.com

(X,Y) location of image in PDF

Conclusion :

In this Apache PDFBox Tutorial, we have learnt to get co-ordinates or location and size of images in pdf document and also learnt what x and y coordinates mean for an image in a pdf.

How to setup a Java project with PDFBox

In this PDFBox Tutorial, we shall learn to setup a Java project with PDFBox, and start working with pdfbox examples.

Step by step process to setup a Java project with PDFBox

Following are the steps to be followed to setup PDFBox in Eclipse Java Project. The steps should remain the same for other IDEs as well.

  1. Create a new Java Project in Eclipse, PdfBox2Examples.
    File ? New ? Java Project
  2. Download jars from https://pdfbox.apache.org/download.cgi.
    Download PdfBox jars - Setup a Java project with PdfBox - PdfBox Tutorial - www.tutorialkart.com

    Download PdfBox Jars

  3. Download apache commons logging jar from here.
  4. Add all these jars to the Build Path.
    Select Project “PdfBox2Examples” ? File ? Properties ? Java Build Path ? Libraries ? Add JARs

    Setup Build Path - Setup a Java project with PdfBox - PdfBox Tutorial - www.tutorialkart.com

    Setup Build Path

  5. The Java Project, PdfBox2Examples, is ready to work with PDFBox libraries.
  6. Run the following example to verify if the setup is successful.

Example Program :

 

How to read all the text from pdf document using PDFBox 2.0

In this PDFBox Tutorial, we shall learn to read all the text from pdf document using PDFBox 2.0 libraries in a Java Program.

Read all the text from pdf document using PDFBox 2.0

PDF document may contain text, embedded images etc., as its contents. PDFTextStripper class in PDFBox provides functions to extract all the text from PDF document.
Following are the steps that are helpful in extracting the text from pdf :

Step 1 : Load PDF

Load the pdf file into PDDocument

Step 2 : Use PDFTextStripper.getText method

Get the text from doc using PDFTextStripper

PDFTextStripper ignores formatting and placement of text chunks in the pdf document. PDFTextStripper just strips out all the text from all the pages of pdf document.
getText returns the text of the pdf document.

Complete Java Program

And pdf file used in the example is ? sample.pdf

Reference :

You may find more information about PDFTextStripper class in the java documentation of PDFTextStripper class, visit ? here.

Conclusion :

We have learnt to read all the text from pdf document using PDFBox 2.0.

How to create a PDF file and write text into it using PDFBox ?

Create a PDF file and write text into it using PDFBox 2.0

Create a PDF file and write text into it using PDFBox 2.0 – In this PDFBox Tutorial, we shall see how to create a PDF file and write text into it using PDFBox 2.0. We shall take a step by step understanding in doing this.

Following are the programatical steps required to create and write text to a PDF file using PDFBox 2.0 :

Step 1 :

Create a PDF document in-memory

Step 2 :

Create a PDF page.

Step 3 :

Add the page to the PDF document.

Step 4 :

Ready the contents to be written in the page. Use a stream. This stream has to be closed after usage.

Step 5 :

Begin some text operations.

Step 6 :

Set the font and font size of text, to draw it on PDF page.

Step 7 :

Start a new line at offset (x,y) as shown below (say for a characters ‘g’) :

Offset (x,y) in PDFBox 2.0 - PDFBox Tutorial - PDFBox Example - TutorialKart.com

Offset (x,y) in PDFBox 2.0

 

Step 8 :

Show the text at the location specified.

Step 9 :

Stop the text operations.

Step 10 :

Close the content stream.

Step 11 :

Save the PDF document.

Step 12 :

Close the in-memory pdf document.

The complete program, CreatePdfWithTextDemo.java, is shown below :

The pdf generated is as shown in the below picture :

Create a PDF file and write text into it using PDFBox 2.0 - PDFBox 2.0 Tutorail - PDFBox 2.0 Example - TutorialKart.com

Create a PDF file and write text into it using PDFBox 2.0

The project structure is as shown below : The pdf file is created at the root of project.

Demo Project Structure - PDFBox Tutorial - PDFBox Example - TutorialKart.com

Demo Project Structure

Conclusion :

In this PDFBox Tutorial / PDFBox Example we have seen how to create a PDF file and write text into it using PDFBox 2.0.