Skip to content

Configuration Reference

PDF Extractor Parameters

Constructor Parameters

Parameter Type Default Description
path Path \| str required Path to the PDF file
output_unit CoordinateUnit POINTS Output coordinate unit
character_merger CharacterMerger BasicLineMerger() Text merging strategy

Extract Parameters

Parameter Type Default Description
pages Sequence[int] None Pages to extract (0-indexed). None = all pages
max_workers int 1 Number of parallel workers
executor ExecutorType THREAD Executor type for parallelism
table_options dict None Tabula options for table extraction

Table Extraction Options (Tabula)

Pass these options via the table_options parameter to enable table extraction.

Common Options

Option Type Description
lattice bool Use lattice mode for tables with visible cell borders
stream bool Use stream mode for tables without visible borders
guess bool Automatically guess table areas
multiple_tables bool Extract multiple tables per page

Area Options

Coordinates in area and columns follow the extractor's output_unit setting. The default unit is POINTS (1/72 inch).

Option Type Description
area tuple[float, float, float, float] Extract area: (top, left, bottom, right) in output units
columns list[float] X-coordinates for column splitting in output units

Example

from unifex import PdfExtractor, CoordinateUnit

with PdfExtractor("table.pdf") as extractor:
    # Lattice mode for bordered tables
    result = extractor.extract(table_options={"lattice": True})

    # Stream mode for borderless tables
    result = extractor.extract(table_options={"stream": True})

    # Extract from specific area (coordinates in points - the default)
    result = extractor.extract(table_options={
        "area": (100, 50, 400, 500),  # top, left, bottom, right in points
        "columns": [100, 200, 350],   # column boundaries in points
        "lattice": True,
    })

# Using different coordinate units
with PdfExtractor("table.pdf", output_unit=CoordinateUnit.INCHES) as extractor:
    # Coordinates are now in inches
    result = extractor.extract(table_options={
        "area": (1.0, 0.5, 5.0, 7.0),  # top, left, bottom, right in inches
        "columns": [1.5, 3.0, 5.5],    # column boundaries in inches
    })

Environment Variables

OCR Extractors

Variable Description Default
UNIFEX_AZURE_DI_ENDPOINT Azure Document Intelligence endpoint URL -
UNIFEX_AZURE_DI_KEY Azure Document Intelligence API key -
UNIFEX_AZURE_DI_MODEL Azure model ID prebuilt-read
UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME Google Document AI processor name -
UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH Path to Google service account JSON -

LLM Providers

Variable Description
OPENAI_API_KEY OpenAI API key
ANTHROPIC_API_KEY Anthropic API key
GOOGLE_API_KEY Google AI API key
AZURE_OPENAI_API_KEY Azure OpenAI API key
AZURE_OPENAI_ENDPOINT Azure OpenAI endpoint URL
AZURE_OPENAI_API_VERSION Azure OpenAI API version (default: 2024-02-15-preview)

Coordinate Units

All extractors support the output_unit parameter to control the coordinate system of bounding boxes.

Unit Description Use Case
POINTS 1/72 inch PDF native, resolution-independent
PIXELS Pixels at specified DPI Image processing, display
INCHES Imperial inches Print layout
NORMALIZED 0-1 range relative to page ML models, relative positioning

Note

PDF extractor doesn't support PIXELS output because PDFs don't have inherent DPI. OCR extractors use the dpi parameter for pixel conversions.

DPI Settings

OCR extractors accept a dpi parameter that affects:

  1. PDF to image conversion - Higher DPI = larger images, better quality
  2. Coordinate conversion - Used when converting between PIXELS and other units

Recommended values:

Use Case DPI
Fast preview 72-100
Standard OCR 150-200
High quality 300+

Parallel Processing

max_workers

Number of parallel workers for page extraction. Default is 1 (sequential).

result = extractor.extract(max_workers=4)

executor

Type of executor for parallel processing:

  • ExecutorType.THREAD (default) - Thread pool executor
  • ExecutorType.PROCESS - Process pool executor
from unifex import ExecutorType

result = extractor.extract(max_workers=4, executor=ExecutorType.PROCESS)