Get Coordinates of Characters in PDF using Apache PDFBox

To extract coordinates or location and size of characters in a PDF using Apache PDFBox, extend the PDFTextStripper class and override the writeString(String string, List<TextPosition> textPositions) method.

The class org.apache.pdfbox.text.PDFTextStripper reads text content from a PDF page. While stripping text, PDFBox passes each text segment along with a list of TextPosition objects. Each TextPosition contains coordinate and measurement details for a character or glyph.

List<TextPosition> in the writeString() method contains information about the characters, such as Unicode value, X coordinate, Y coordinate, height, width, font size, scaling values, and space width. These values are useful when you want to locate text in a PDF, highlight text, compare extracted positions, or build a custom PDF text extraction workflow.

PDFBox TextPosition Coordinates Explained

PDF text coordinates can be confusing because a PDF page has its own coordinate system, and text may also be affected by page rotation, text direction, font metrics, and transformations. In PDFBox, TextPosition provides adjusted methods such as getXDirAdj(), getYDirAdj(), getHeightDir(), and getWidthDirAdj(). These are commonly used when you want practical character positions from left to right and top to bottom.

PDFBox methodWhat it givesCommon use
getUnicode()The character or glyph textPrint or compare extracted text
getXDirAdj()Adjusted X coordinateFind horizontal position of the character
getYDirAdj()Adjusted Y coordinateFind vertical position of the character
getHeightDir()Adjusted character heightEstimate text box height
getWidthDirAdj()Adjusted character widthEstimate character or word width
getFontSize()Font size valueInspect text style or size

You can refer to the official PDFBox TextPosition JavaDoc for the complete list of available methods.

Steps to Extract Coordinates of Characters in PDF

Following is a step by step process to extract coordinates or position of characters in a PDF file with PDFBox.

1. Extend PDFTextStripper for Character Position Extraction

Create a Java Class and extend it with PDFTextStripper.

</>
Copy
public class GetCharLocationAndSize extends PDFTextStripper {
  . . .
}

This custom class receives the text positions while PDFBox reads the PDF content stream.

2. Call writeText Method on the PDF Document

Set page boundaries from the first page to the last page to strip text and call the method writeText().

</>
Copy
PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

The setSortByPosition(true) call asks PDFBox to sort the extracted text by position. This is helpful when you want the output to follow the visual reading order more closely, especially for simple left-to-right documents.

3. Override writeString to Access TextPosition Objects

writeString method receives information about the text positions of characters in a stream. We shall override writeString method as shown below.

</>
Copy
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

The string parameter contains the text fragment, and textPositions contains the corresponding position objects. In many simple PDFs, each item in textPositions maps to an individual visible character. In some PDFs, ligatures, encoded glyphs, or special font mappings can make this relationship less direct.

4. Print Character Locations and Size from TextPosition

For each item in list of TextPosition which is for an individual character, print the coordinates and size.

The important values for a simple character coordinate report are Unicode, X position, Y position, height, and width.

Example 1 – Extract Coordinates or Position of Characters in PDF

In this example, we will take a PDF with text, and extract the (X, Y) coordinates of characters.

GetCharLocationAndSize.java

</>
Copy
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
 
/**
* This is an example on how to get the x/y coordinates and size of each character in PDF
*/
public class GetCharLocationAndSize extends PDFTextStripper {
 
    public GetCharLocationAndSize() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf";
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetCharLocationAndSize();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
        for (TextPosition text : textPositions) {
            System.out.println(text.getUnicode()+ " [(X=" + text.getXDirAdj() + ",Y=" +
                    text.getYDirAdj() + ") height=" + text.getHeightDir() + " width=" +
                    text.getWidthDirAdj() + "]");
        }
    }
}

Output

2 [(X=26.004425,Y=22.003723) height=5.833024 width=5.0907116]
0 [(X=31.095137,Y=22.003723) height=5.833024 width=5.0907116]
1 [(X=36.18585,Y=22.003723) height=5.833024 width=5.0907097]
7 [(X=41.276558,Y=22.003723) height=5.833024 width=5.0907097]
- [(X=46.367268,Y=22.003723) height=5.833024 width=2.8872108]
8 [(X=49.25448,Y=22.003723) height=5.833024 width=5.0907097]
- [(X=54.34519,Y=22.003723) height=5.833024 width=2.8872108]
6 [(X=57.2324,Y=22.003723) height=5.833024 width=5.0907097]
W [(X=226.4448,Y=22.003723) height=5.833024 width=7.911499]
e [(X=233.88747,Y=22.003723) height=5.833024 width=4.922714]
l [(X=238.81018,Y=22.003723) height=5.833024 width=2.2230377]
c [(X=241.03322,Y=22.003723) height=5.833024 width=4.399185]
o [(X=245.4324,Y=22.003723) height=5.833024 width=4.895355]
m [(X=250.32776,Y=22.003723) height=5.833024 width=7.7943115]
e [(X=258.12207,Y=22.003723) height=5.833024 width=4.922699]

Download the PDF document here apache.pdf  if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

PDFBox Version Notes for Loading a PDF File

The example above uses the PDFBox 2.x style of loading a PDF document with PDDocument.load(new File(fileName)). If your project uses PDFBox 3.x, loading is usually done with the Loader utility class instead. The text extraction idea remains the same: load the document, create your custom PDFTextStripper, and call writeText().

</>
Copy
File file = new File("apache.pdf");
try (PDDocument document = Loader.loadPDF(file)) {
    PDFTextStripper stripper = new GetCharLocationAndSize();
    stripper.setSortByPosition(true);
    stripper.writeText(document, new OutputStreamWriter(new ByteArrayOutputStream()));
}

If you use this PDFBox 3.x loading style, add the required import for org.apache.pdfbox.Loader. Do not mix PDFBox 2.x and 3.x loading code in the same project.

How PDF Coordinates Affect Extracted Text Positions

PDF coordinates are measured in user space units. A typical PDF page uses points, where 72 points are equal to one inch. In the raw PDF coordinate system, the origin is usually at the bottom-left of the page and the Y axis increases upward. PDFBox adjusted text position methods are easier to use for text extraction because they account for text direction and page orientation in a more practical way.

When checking the output, remember these points:

  • X changes as characters move from left to right.
  • Y changes when text appears on a different line or area of the page.
  • width depends on the font and the glyph, so narrow letters and wide letters have different values.
  • height is related to the rendered text height, not simply the declared font size.
  • Scanned PDFs may not return useful text positions unless OCR text is present.

Extract Word Coordinates from Character Coordinates

The example prints coordinates for individual characters. If you want the position of a word, collect consecutive TextPosition values until you reach a space or a line break. The word’s left coordinate can be taken from the first character, and the right boundary can be estimated from the last character’s X coordinate plus its width.

</>
Copy
private void printWordPosition(String word, List<TextPosition> positions) {
    if (word.isEmpty() || positions.isEmpty()) {
        return;
    }

    TextPosition first = positions.get(0);
    TextPosition last = positions.get(positions.size() - 1);

    float x = first.getXDirAdj();
    float y = first.getYDirAdj();
    float width = (last.getXDirAdj() + last.getWidthDirAdj()) - x;
    float height = first.getHeightDir();

    System.out.println(word + " [(X=" + x + ",Y=" + y + ") height=" + height + " width=" + width + "]");
}

This approach works well for simple text runs. For multi-line words, rotated text, custom encodings, or complex scripts, you may need additional grouping logic based on Y coordinate, spacing, and text direction.

Common Issues When Getting Text Coordinates in PDFBox

IssueReasonWhat to check
No text is extractedThe PDF may be scanned image contentCheck whether the PDF contains selectable text or OCR text
Coordinates do not match visible orderPDF text content order can differ from display orderUse setSortByPosition(true) and inspect page layout
Y coordinate looks unexpectedPDF and adjusted coordinate systems differCompare raw and adjusted values for your page
Characters appear as missing or incorrect symbolsFont encoding or ToUnicode mapping may be limitedTest with another PDF or inspect embedded fonts
Word boxes are inaccurateCharacters may have variable widths or unusual spacingGroup by spacing thresholds and line position

Editorial QA Checklist for this PDFBox Character Coordinate Tutorial

  • Confirm whether the project uses PDFBox 2.x or PDFBox 3.x before copying the PDF loading code.
  • Verify that the PDF contains selectable text and is not only a scanned image.
  • Check coordinates on more than one page when the document has rotation, headers, footers, or columns.
  • Use TextPosition adjusted methods consistently when comparing X, Y, width, and height.
  • Test word grouping separately if the requirement is to locate words instead of individual characters.

Frequently Asked Questions about PDFBox Text Coordinates

How do I get coordinates of text in a PDF using PDFBox?

Extend PDFTextStripper, override writeString(), and read the TextPosition objects passed to that method. Use methods such as getXDirAdj(), getYDirAdj(), getHeightDir(), and getWidthDirAdj() to get adjusted text coordinates and size.

Can PDFBox extract coordinates for every character in a PDF?

PDFBox can extract character position information when the PDF contains text content that PDFBox can decode. If the PDF is a scanned image without OCR text, there may be no text positions to extract.

Why are PDFBox Y coordinates different from what I expected?

PDF pages use a coordinate system that may differ from screen coordinates. Raw PDF coordinates usually start near the bottom-left of the page, while adjusted text position methods in PDFBox are designed to be easier for text extraction and reading order.

How can I get word coordinates instead of character coordinates in PDFBox?

Collect consecutive TextPosition objects into a word until you encounter a space, line change, or spacing gap. Use the first character for the left X and Y position, then use the last character’s X position plus width to estimate the word width.

What is the difference between PDFBox and iText for text coordinates?

Both libraries can be used for PDF processing, but their APIs and licensing models are different. PDFBox is commonly used as an open-source Java library for reading and processing PDFs. iText also provides PDF features, but you should review its current license terms before using it in a project.

Conclusion

In this PDFBox Tutorial, we have learnt to extract coordinates or position of characters in PDF document and also a way to extract Unicode, X coordinate, Y coordinate, height, width, x-scaling value, y-scaling value, font size, space width, etc..

The key idea is to use a custom PDFTextStripper and inspect each TextPosition object. Once character positions are available, you can build additional logic to locate words, lines, or regions of text in a PDF document.