Skip to content

PDF Extraction

Native PDF text extraction using pypdfium2, with optional table extraction via tabula-py.

Basic Usage

from unifex import PdfExtractor

with PdfExtractor("document.pdf") as extractor:
    result = extractor.extract()
    for page in result.document.pages:
        for text in page.texts[:2]:  # Show first 2 texts
            print(text.text)

Using the Factory

from unifex import create_extractor, ExtractorType

with create_extractor("document.pdf", ExtractorType.PDF) as extractor:
    result = extractor.extract()
    print(f"Extracted {len(result.document.pages)} pages")

Extracting Specific Pages

from unifex import PdfExtractor

with PdfExtractor("document.pdf") as extractor:
    # Extract only page 0
    result = extractor.extract(pages=[0])
    print(f"Extracted page: {result.document.pages[0].page}")

Parallel Extraction

Extract multiple pages concurrently for faster processing:

from unifex import PdfExtractor

with PdfExtractor("document.pdf") as extractor:
    # Use 2 parallel workers
    result = extractor.extract(max_workers=2)
    print(f"Extracted {len(result.document.pages)} pages in parallel")

See Parallel Processing for more details.

Coordinate Units

Control the output coordinate system for bounding boxes:

from unifex import PdfExtractor, CoordinateUnit

# Points (default) - 1/72 inch, PDF native
with PdfExtractor("document.pdf", output_unit=CoordinateUnit.POINTS) as extractor:
    result = extractor.extract()
    print(f"Points: {result.document.pages[0].texts[0].bbox}")

# Inches
with PdfExtractor("document.pdf", output_unit=CoordinateUnit.INCHES) as extractor:
    result = extractor.extract()
    print(f"Inches: {result.document.pages[0].texts[0].bbox}")

# Normalized (0-1 range relative to page dimensions)
with PdfExtractor("document.pdf", output_unit=CoordinateUnit.NORMALIZED) as extractor:
    result = extractor.extract()
    print(f"Normalized: {result.document.pages[0].texts[0].bbox}")

Available units:

Unit Description
POINTS 1/72 inch (PDF native, resolution-independent)
PIXELS Pixels at specified DPI (not supported for PDF)
INCHES Imperial inches
NORMALIZED 0-1 range relative to page dimensions

Note

PDF extractor doesn't support PIXELS output because PDFs don't have inherent DPI.

Table Extraction

Extract tables from PDFs using tabula-py. Requires the tables extra:

uv sync --extra tables

Basic Table Extraction

from unifex import PdfExtractor

with PdfExtractor("table.pdf") as extractor:
    result = extractor.extract(table_options={})
    for page in result.document.pages:
        for table in page.tables:
            print(f"Table with {len(table.rows)} rows")
            for row in table.rows:
                print([cell.text for cell in row])

Table Options

Pass tabula options to control extraction behavior. Coordinates in area and columns follow the extractor's output_unit setting (default: POINTS).

from unifex import PdfExtractor, CoordinateUnit

with PdfExtractor("table.pdf") as extractor:
    # Lattice mode: for tables with visible borders
    result = extractor.extract(table_options={"lattice": True})

    # Stream mode: for tables without visible borders
    result = extractor.extract(table_options={"stream": True})

    # Extract from specific area (top, left, bottom, right in points - the default)
    result = extractor.extract(table_options={
        "area": (100, 50, 400, 500),
        "columns": [100, 200, 350],  # column boundaries in points
    })

    # Multiple tables per page
    result = extractor.extract(table_options={"multiple_tables": True})

# Using inches instead of points
with PdfExtractor("table.pdf", output_unit=CoordinateUnit.INCHES) as extractor:
    result = extractor.extract(table_options={
        "area": (1.0, 0.5, 5.0, 7.0),   # in inches
        "columns": [1.5, 3.0, 5.5],      # in inches
    })

Path Objects

Both string paths and Path objects are supported:

from pathlib import Path
from unifex import PdfExtractor

with PdfExtractor(Path("document.pdf")) as extractor:
    result = extractor.extract()
    print(f"Extracted {len(result.document.pages)} pages")

When to Use PDF Extraction

PDF extraction is ideal for:

  • Documents with embedded text (not scanned images)
  • High-quality PDFs where text is selectable
  • When you need precise character positioning
  • Extracting structured tables from PDFs

For scanned PDFs or images, use OCR Extraction instead.