Configuration Reference¶

PDF Extractor Parameters¶

Constructor Parameters¶

Parameter	Type	Default	Description
`path`	`Path \\| str`	required	Path to the PDF file
`output_unit`	`CoordinateUnit`	`POINTS`	Output coordinate unit
`character_merger`	`CharacterMerger`	`BasicLineMerger()`	Text merging strategy

Extract Parameters¶

Parameter	Type	Default	Description
`pages`	`Sequence[int]`	`None`	Pages to extract (0-indexed). `None` = all pages
`max_workers`	`int`	`1`	Number of parallel workers
`executor`	`ExecutorType`	`THREAD`	Executor type for parallelism
`table_options`	`dict`	`None`	Tabula options for table extraction

Table Extraction Options (Tabula)¶

Pass these options via the table_options parameter to enable table extraction.

Common Options¶

Option	Type	Description
`lattice`	`bool`	Use lattice mode for tables with visible cell borders
`stream`	`bool`	Use stream mode for tables without visible borders
`guess`	`bool`	Automatically guess table areas
`multiple_tables`	`bool`	Extract multiple tables per page

Area Options¶

Coordinates in area and columns follow the extractor's output_unit setting. The default unit is POINTS (1/72 inch).

Option	Type	Description
`area`	`tuple[float, float, float, float]`	Extract area: (top, left, bottom, right) in output units
`columns`	`list[float]`	X-coordinates for column splitting in output units

Example¶

from unifex import PdfExtractor, CoordinateUnit

with PdfExtractor("table.pdf") as extractor:
    # Lattice mode for bordered tables
    result = extractor.extract(table_options={"lattice": True})

    # Stream mode for borderless tables
    result = extractor.extract(table_options={"stream": True})

    # Extract from specific area (coordinates in points - the default)
    result = extractor.extract(table_options={
        "area": (100, 50, 400, 500),  # top, left, bottom, right in points
        "columns": [100, 200, 350],   # column boundaries in points
        "lattice": True,
    })

# Using different coordinate units
with PdfExtractor("table.pdf", output_unit=CoordinateUnit.INCHES) as extractor:
    # Coordinates are now in inches
    result = extractor.extract(table_options={
        "area": (1.0, 0.5, 5.0, 7.0),  # top, left, bottom, right in inches
        "columns": [1.5, 3.0, 5.5],    # column boundaries in inches
    })

Environment Variables¶

OCR Extractors¶

Variable	Description	Default
`UNIFEX_AZURE_DI_ENDPOINT`	Azure Document Intelligence endpoint URL	-
`UNIFEX_AZURE_DI_KEY`	Azure Document Intelligence API key	-
`UNIFEX_AZURE_DI_MODEL`	Azure model ID	`prebuilt-read`
`UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME`	Google Document AI processor name	-
`UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH`	Path to Google service account JSON	-

LLM Providers¶

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic API key
`GOOGLE_API_KEY`	Google AI API key
`AZURE_OPENAI_API_KEY`	Azure OpenAI API key
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint URL
`AZURE_OPENAI_API_VERSION`	Azure OpenAI API version (default: `2024-02-15-preview`)

Coordinate Units¶

All extractors support the output_unit parameter to control the coordinate system of bounding boxes.

Unit	Description	Use Case
`POINTS`	1/72 inch	PDF native, resolution-independent
`PIXELS`	Pixels at specified DPI	Image processing, display
`INCHES`	Imperial inches	Print layout
`NORMALIZED`	0-1 range relative to page	ML models, relative positioning

Note

PDF extractor doesn't support PIXELS output because PDFs don't have inherent DPI. OCR extractors use the dpi parameter for pixel conversions.

DPI Settings¶

OCR extractors accept a dpi parameter that affects:

PDF to image conversion - Higher DPI = larger images, better quality
Coordinate conversion - Used when converting between PIXELS and other units

Recommended values:

Use Case	DPI
Fast preview	72-100
Standard OCR	150-200
High quality	300+

Parallel Processing¶

max_workers¶

Number of parallel workers for page extraction. Default is 1 (sequential).

result = extractor.extract(max_workers=4)

executor¶

Type of executor for parallel processing:

ExecutorType.THREAD (default) - Thread pool executor
ExecutorType.PROCESS - Process pool executor

from unifex import ExecutorType

result = extractor.extract(max_workers=4, executor=ExecutorType.PROCESS)