API Reference¶

Factory Function¶

create_extractor¶

The main entry point for creating extractors.

unifex.create_extractor ¶

create_extractor(
    path: Path | str,
    extractor_type: ExtractorType,
    *,
    languages: list[str] | None = None,
    dpi: int = 200,
    use_gpu: bool = False,
    credentials: dict[str, str] | None = None,
    output_unit: CoordinateUnit = CoordinateUnit.POINTS,
    character_merger: str | None = None,
) -> BaseExtractor

Create an extractor by type with unified parameters.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to document/image file (Path object or string).	required
`extractor_type`	`ExtractorType`	ExtractorType enum value specifying which extractor to use: - ExtractorType.PDF - Native PDF extraction - ExtractorType.EASYOCR - EasyOCR for images and PDFs (auto-detects) - ExtractorType.TESSERACT - Tesseract for images and PDFs (auto-detects) - ExtractorType.PADDLE - PaddleOCR for images and PDFs (auto-detects) - ExtractorType.AZURE_DI - Azure Document Intelligence - ExtractorType.GOOGLE_DOCAI - Google Document AI	required
`languages`	`list[str] \| None`	Language codes for OCR (default: ["en"]). EasyOCR/Tesseract use full list, PaddleOCR uses first language.	`None`
`dpi`	`int`	DPI for PDF-to-image conversion (default: 200).	`200`
`use_gpu`	`bool`	Enable GPU acceleration where supported (default: False).	`False`
`credentials`	`dict[str, str] \| None`	Override credentials dict. If None, reads from env vars: - UNIFEX_AZURE_DI_ENDPOINT, UNIFEX_AZURE_DI_KEY for Azure - UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME, UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH for Google	`None`
`output_unit`	`CoordinateUnit`	Coordinate unit for output (default: POINTS). - CoordinateUnit.POINTS - 1/72 inch (PDF native, resolution-independent) - CoordinateUnit.PIXELS - Pixels at the specified DPI - CoordinateUnit.INCHES - Imperial inches - CoordinateUnit.NORMALIZED - 0-1 range relative to page dimensions	`POINTS`
`character_merger`	`str \| None`	Character merger strategy for PDF extractor (default: basic-line). - "basic-line" - Merge characters into lines - "keep-char" - Keep each character as separate TextBlock	`None`

Returns:

Type	Description
`BaseExtractor`	Configured extractor instance.

Raises:

Type	Description
`ValueError`	If extractor_type is invalid or required credentials are missing.

Example

from unifex import create_extractor, ExtractorType, CoordinateUnit with create_extractor("doc.pdf", ExtractorType.PDF) as ext: ... doc = ext.extract() # Coordinates in points (default) with create_extractor("doc.pdf", ExtractorType.EASYOCR, ... output_unit=CoordinateUnit.PIXELS) as ext: ... doc = ext.extract() # Coordinates in pixels

Source code in unifex/text_factory.py

def create_extractor(  # noqa: PLR0913
    path: Path | str,
    extractor_type: ExtractorType,
    *,
    languages: list[str] | None = None,
    dpi: int = 200,
    use_gpu: bool = False,
    credentials: dict[str, str] | None = None,
    output_unit: CoordinateUnit = CoordinateUnit.POINTS,
    character_merger: str | None = None,
) -> BaseExtractor:
    """Create an extractor by type with unified parameters.

    Args:
        path: Path to document/image file (Path object or string).
        extractor_type: ExtractorType enum value specifying which extractor to use:
            - ExtractorType.PDF - Native PDF extraction
            - ExtractorType.EASYOCR - EasyOCR for images and PDFs (auto-detects)
            - ExtractorType.TESSERACT - Tesseract for images and PDFs (auto-detects)
            - ExtractorType.PADDLE - PaddleOCR for images and PDFs (auto-detects)
            - ExtractorType.AZURE_DI - Azure Document Intelligence
            - ExtractorType.GOOGLE_DOCAI - Google Document AI
        languages: Language codes for OCR (default: ["en"]).
            EasyOCR/Tesseract use full list, PaddleOCR uses first language.
        dpi: DPI for PDF-to-image conversion (default: 200).
        use_gpu: Enable GPU acceleration where supported (default: False).
        credentials: Override credentials dict. If None, reads from env vars:
            - UNIFEX_AZURE_DI_ENDPOINT, UNIFEX_AZURE_DI_KEY for Azure
            - UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME, UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH for Google
        output_unit: Coordinate unit for output (default: POINTS).
            - CoordinateUnit.POINTS - 1/72 inch (PDF native, resolution-independent)
            - CoordinateUnit.PIXELS - Pixels at the specified DPI
            - CoordinateUnit.INCHES - Imperial inches
            - CoordinateUnit.NORMALIZED - 0-1 range relative to page dimensions
        character_merger: Character merger strategy for PDF extractor (default: basic-line).
            - "basic-line" - Merge characters into lines
            - "keep-char" - Keep each character as separate TextBlock

    Returns:
        Configured extractor instance.

    Raises:
        ValueError: If extractor_type is invalid or required credentials are missing.

    Example:
        >>> from unifex import create_extractor, ExtractorType, CoordinateUnit
        >>> with create_extractor("doc.pdf", ExtractorType.PDF) as ext:
        ...     doc = ext.extract()  # Coordinates in points (default)
        >>> with create_extractor("doc.pdf", ExtractorType.EASYOCR,
        ...                       output_unit=CoordinateUnit.PIXELS) as ext:
        ...     doc = ext.extract()  # Coordinates in pixels
    """
    languages = languages or ["en"]

    if extractor_type == ExtractorType.PDF:
        from unifex.pdf import PdfExtractor

        merger = get_character_merger(character_merger) if character_merger else None
        return PdfExtractor(path, output_unit=output_unit, character_merger=merger)

    elif extractor_type == ExtractorType.EASYOCR:
        from unifex.ocr.extractors.easy_ocr import EasyOcrExtractor

        return EasyOcrExtractor(
            path, languages=languages, gpu=use_gpu, dpi=dpi, output_unit=output_unit
        )

    elif extractor_type == ExtractorType.TESSERACT:
        from unifex.ocr.extractors.tesseract_ocr import TesseractOcrExtractor

        return TesseractOcrExtractor(path, languages=languages, dpi=dpi, output_unit=output_unit)

    elif extractor_type == ExtractorType.PADDLE:
        from unifex.ocr.extractors.paddle_ocr import PaddleOcrExtractor

        # PaddleOCR uses single language string
        lang = languages[0] if languages else "en"
        return PaddleOcrExtractor(
            path, lang=lang, use_gpu=use_gpu, dpi=dpi, output_unit=output_unit
        )

    elif extractor_type == ExtractorType.AZURE_DI:
        from unifex.ocr.extractors.azure_di import AzureDocumentIntelligenceExtractor

        endpoint = _get_credential("UNIFEX_AZURE_DI_ENDPOINT", credentials)
        key = _get_credential("UNIFEX_AZURE_DI_KEY", credentials)

        if not endpoint or not key:
            raise ValueError(
                "Azure credentials required. Set UNIFEX_AZURE_DI_ENDPOINT and UNIFEX_AZURE_DI_KEY "
                "environment variables or pass credentials dict."
            )

        return AzureDocumentIntelligenceExtractor(
            path, endpoint=endpoint, key=key, output_unit=output_unit
        )

    elif extractor_type == ExtractorType.GOOGLE_DOCAI:
        from unifex.ocr.extractors.google_docai import GoogleDocumentAIExtractor

        processor_name = _get_credential("UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME", credentials)
        credentials_path = _get_credential("UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH", credentials)

        if not processor_name:
            raise ValueError(
                "Google Document AI processor name required. "
                "Set UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME env var or pass credentials dict."
            )

        if not credentials_path:
            raise ValueError(
                "Google Document AI credentials path required. "
                "Set UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH env var or pass credentials dict."
            )

        return GoogleDocumentAIExtractor(
            path,
            processor_name=processor_name,
            credentials_path=credentials_path,
            output_unit=output_unit,
        )

    else:
        raise ValueError(f"Unknown extractor type: {extractor_type}")

Extractor Types¶

ExtractorType¶

Enum for available extractor types.

unifex.ExtractorType ¶

Bases: StrEnum

Source code in unifex/base/models.py

class ExtractorType(StrEnum):
    PDF = "pdf"
    EASYOCR = "easyocr"
    TESSERACT = "tesseract"
    PADDLE = "paddle"
    AZURE_DI = "azure-di"
    GOOGLE_DOCAI = "google-docai"

Coordinate Units¶

CoordinateUnit¶

Enum for coordinate output units.

unifex.CoordinateUnit ¶

Bases: StrEnum

Units for coordinate output.

Source code in unifex/base/models.py

class CoordinateUnit(StrEnum):
    """Units for coordinate output."""

    PIXELS = "pixels"  # Image pixels at a given DPI
    POINTS = "points"  # 1/72 inch (PDF native, default)
    INCHES = "inches"  # Imperial inches
    NORMALIZED = "normalized"  # 0-1 relative to page dimensions

Executor Types¶

ExecutorType¶

Enum for parallel execution modes.

unifex.base.ExecutorType ¶

Bases: str, Enum

Type of executor for parallel extraction.

Source code in unifex/base/base.py

class ExecutorType(str, Enum):
    """Type of executor for parallel extraction."""

    THREAD = "thread"
    PROCESS = "process"

LLM Extraction¶

extract_structured¶

Extract structured data from a document using an LLM (single request).

unifex.llm_factory.extract_structured ¶

extract_structured(
    path: Path | str,
    model: str,
    *,
    schema: type[T],
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMExtractionResult[T]

extract_structured(
    path: Path | str,
    model: str,
    *,
    schema: None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMExtractionResult[dict[str, Any]]

extract_structured(
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: SingleExtractor[T] | None = None,
) -> LLMExtractionResult[T | dict[str, Any]]

Extract structured data from a document using an LLM.

All specified pages are sent in a single request.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to document/image file.	required
`model`	`str`	Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").	required
`schema`	`type[T] \| None`	Pydantic model for structured output. None for free-form dict.	`None`
`prompt`	`str \| None`	Custom extraction prompt. Auto-generated from schema if None.	`None`
`pages`	`list[int] \| None`	Page numbers to extract from (0-indexed). None for all pages.	`None`
`dpi`	`int`	DPI for PDF-to-image conversion.	`200`
`max_retries`	`int`	Max retry attempts with validation feedback.	`3`
`temperature`	`float`	Sampling temperature (0.0 = deterministic).	`0.0`
`credentials`	`dict[str, str] \| None`	Override credentials dict (otherwise uses env vars).	`None`
`base_url`	`str \| None`	Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).	`None`
`headers`	`dict[str, str] \| None`	Custom HTTP headers for OpenAI-compatible APIs.	`None`
`_extractor`	`SingleExtractor[T] \| None`	Internal parameter for dependency injection (testing only).	`None`

Returns:

Type	Description
`LLMExtractionResult[T \| dict[str, Any]]`	LLMExtractionResult containing extracted data,
`LLMExtractionResult[T \| dict[str, Any]]`	model info, and provider.

Source code in unifex/llm_factory.py

def extract_structured(  # noqa: PLR0913
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: SingleExtractor[T] | None = None,
) -> LLMExtractionResult[T | dict[str, Any]]:
    """Extract structured data from a document using an LLM.

    All specified pages are sent in a single request.

    Args:
        path: Path to document/image file.
        model: Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").
        schema: Pydantic model for structured output. None for free-form dict.
        prompt: Custom extraction prompt. Auto-generated from schema if None.
        pages: Page numbers to extract from (0-indexed). None for all pages.
        dpi: DPI for PDF-to-image conversion.
        max_retries: Max retry attempts with validation feedback.
        temperature: Sampling temperature (0.0 = deterministic).
        credentials: Override credentials dict (otherwise uses env vars).
        base_url: Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).
        headers: Custom HTTP headers for OpenAI-compatible APIs.
        _extractor: Internal parameter for dependency injection (testing only).

    Returns:
        [LLMExtractionResult][unifex.llm.models.LLMExtractionResult] containing extracted data,
        model info, and provider.
    """
    path = Path(path) if isinstance(path, str) else path
    extractor = _extractor or _extract_single

    return extractor(
        path,
        model,
        schema,
        prompt,
        pages,
        dpi,
        max_retries,
        temperature,
        credentials,
        base_url,
        headers,
    )

extract_structured_async¶

Async version of extract_structured.

unifex.llm_factory.extract_structured_async `async` ¶

extract_structured_async(
    path: Path | str,
    model: str,
    *,
    schema: type[T],
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMExtractionResult[T]

extract_structured_async(
    path: Path | str,
    model: str,
    *,
    schema: None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMExtractionResult[dict[str, Any]]

extract_structured_async(
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: AsyncSingleExtractor[T] | None = None,
) -> LLMExtractionResult[T | dict[str, Any]]

Async version of extract_structured.

All specified pages are sent in a single request.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to document/image file.	required
`model`	`str`	Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").	required
`schema`	`type[T] \| None`	Pydantic model for structured output. None for free-form dict.	`None`
`prompt`	`str \| None`	Custom extraction prompt. Auto-generated from schema if None.	`None`
`pages`	`list[int] \| None`	Page numbers to extract from (0-indexed). None for all pages.	`None`
`dpi`	`int`	DPI for PDF-to-image conversion.	`200`
`max_retries`	`int`	Max retry attempts with validation feedback.	`3`
`temperature`	`float`	Sampling temperature (0.0 = deterministic).	`0.0`
`credentials`	`dict[str, str] \| None`	Override credentials dict (otherwise uses env vars).	`None`
`base_url`	`str \| None`	Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).	`None`
`headers`	`dict[str, str] \| None`	Custom HTTP headers for OpenAI-compatible APIs.	`None`
`_extractor`	`AsyncSingleExtractor[T] \| None`	Internal parameter for dependency injection (testing only).	`None`

Returns:

Type	Description
`LLMExtractionResult[T \| dict[str, Any]]`	LLMExtractionResult containing extracted data,
`LLMExtractionResult[T \| dict[str, Any]]`	model info, and provider.

Source code in unifex/llm_factory.py

async def extract_structured_async(  # noqa: PLR0913
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: AsyncSingleExtractor[T] | None = None,
) -> LLMExtractionResult[T | dict[str, Any]]:
    """Async version of extract_structured.

    All specified pages are sent in a single request.

    Args:
        path: Path to document/image file.
        model: Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").
        schema: Pydantic model for structured output. None for free-form dict.
        prompt: Custom extraction prompt. Auto-generated from schema if None.
        pages: Page numbers to extract from (0-indexed). None for all pages.
        dpi: DPI for PDF-to-image conversion.
        max_retries: Max retry attempts with validation feedback.
        temperature: Sampling temperature (0.0 = deterministic).
        credentials: Override credentials dict (otherwise uses env vars).
        base_url: Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).
        headers: Custom HTTP headers for OpenAI-compatible APIs.
        _extractor: Internal parameter for dependency injection (testing only).

    Returns:
        [LLMExtractionResult][unifex.llm.models.LLMExtractionResult] containing extracted data,
        model info, and provider.
    """
    path = Path(path) if isinstance(path, str) else path
    extractor = _extractor or _extract_single_async

    return await extractor(
        path,
        model,
        schema,
        prompt,
        pages,
        dpi,
        max_retries,
        temperature,
        credentials,
        base_url,
        headers,
    )

extract_structured_parallel¶

Extract structured data in parallel (one page per request).

unifex.llm_factory.extract_structured_parallel ¶

extract_structured_parallel(
    path: Path | str,
    model: str,
    *,
    schema: type[T],
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    executor: ExecutorType = ExecutorType.THREAD,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMBatchExtractionResult[T]

extract_structured_parallel(
    path: Path | str,
    model: str,
    *,
    schema: None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    executor: ExecutorType = ExecutorType.THREAD,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMBatchExtractionResult[dict[str, Any]]

extract_structured_parallel(
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    executor: ExecutorType = ExecutorType.THREAD,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: SingleExtractor[T] | None = None,
) -> LLMBatchExtractionResult[T | dict[str, Any]]

Extract structured data from a document in parallel (one page per request).

Each page is extracted in a separate request, allowing parallel processing. Errors on individual pages are captured in the result, not raised.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to document/image file.	required
`model`	`str`	Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").	required
`schema`	`type[T] \| None`	Pydantic model for structured output. None for free-form dict.	`None`
`prompt`	`str \| None`	Custom extraction prompt. Auto-generated from schema if None.	`None`
`pages`	`list[int] \| None`	Page numbers to extract from (0-indexed). None for all pages.	`None`
`max_workers`	`int`	Number of parallel workers.	`4`
`executor`	`ExecutorType`	Type of executor (THREAD or PROCESS) for parallel extraction.	`THREAD`
`dpi`	`int`	DPI for PDF-to-image conversion.	`200`
`max_retries`	`int`	Max retry attempts with validation feedback.	`3`
`temperature`	`float`	Sampling temperature (0.0 = deterministic).	`0.0`
`credentials`	`dict[str, str] \| None`	Override credentials dict (otherwise uses env vars).	`None`
`base_url`	`str \| None`	Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).	`None`
`headers`	`dict[str, str] \| None`	Custom HTTP headers for OpenAI-compatible APIs.	`None`
`_extractor`	`SingleExtractor[T] \| None`	Internal parameter for dependency injection (testing only).	`None`

Returns:

Type	Description
`LLMBatchExtractionResult[T \| dict[str, Any]]`	LLMBatchExtractionResult containing
`LLMBatchExtractionResult[T \| dict[str, Any]]`	per-page PageExtractionResult with data or errors.
`LLMBatchExtractionResult[T \| dict[str, Any]]`	Results are guaranteed to be in the same order as the input pages,
`LLMBatchExtractionResult[T \| dict[str, Any]]`	i.e., `result.results[i]` corresponds to `pages[i]`.

Source code in unifex/llm_factory.py

def extract_structured_parallel(  # noqa: PLR0913
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    executor: ExecutorType = ExecutorType.THREAD,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: SingleExtractor[T] | None = None,
) -> LLMBatchExtractionResult[T | dict[str, Any]]:
    """Extract structured data from a document in parallel (one page per request).

    Each page is extracted in a separate request, allowing parallel processing.
    Errors on individual pages are captured in the result, not raised.

    Args:
        path: Path to document/image file.
        model: Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").
        schema: Pydantic model for structured output. None for free-form dict.
        prompt: Custom extraction prompt. Auto-generated from schema if None.
        pages: Page numbers to extract from (0-indexed). None for all pages.
        max_workers: Number of parallel workers.
        executor: Type of executor (THREAD or PROCESS) for parallel extraction.
        dpi: DPI for PDF-to-image conversion.
        max_retries: Max retry attempts with validation feedback.
        temperature: Sampling temperature (0.0 = deterministic).
        credentials: Override credentials dict (otherwise uses env vars).
        base_url: Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).
        headers: Custom HTTP headers for OpenAI-compatible APIs.
        _extractor: Internal parameter for dependency injection (testing only).

    Returns:
        [LLMBatchExtractionResult][unifex.llm.models.LLMBatchExtractionResult] containing
        per-page [PageExtractionResult][unifex.llm.models.PageExtractionResult] with data or errors.
        Results are guaranteed to be in the same order as the input pages,
        i.e., `result.results[i]` corresponds to `pages[i]`.
    """
    from concurrent.futures import ProcessPoolExecutor

    path = Path(path) if isinstance(path, str) else path
    provider, model_name = _parse_model_string(model)
    extractor = _extractor or _extract_single

    # Get all pages if not specified
    if pages is None:
        from unifex.base import ImageLoader

        loader = ImageLoader(path, dpi=dpi)
        pages = list(range(loader.page_count))
        loader.close()

    # Parallel execution
    executor_class = ProcessPoolExecutor if executor == ExecutorType.PROCESS else ThreadPoolExecutor

    page_results: list[PageExtractionResult[T | dict[str, Any]] | None] = [None] * len(pages)
    total_usage: dict[str, int] = {}

    with executor_class(max_workers=max_workers) as pool:
        future_to_idx = {
            pool.submit(
                extractor,
                path,
                model,
                schema,
                prompt,
                [page],  # Single page per request
                dpi,
                max_retries,
                temperature,
                credentials,
                base_url,
                headers,
            ): (i, page)
            for i, page in enumerate(pages)
        }
        for future in as_completed(future_to_idx):
            idx, page = future_to_idx[future]
            try:
                result = future.result()
                page_results[idx] = PageExtractionResult(
                    page=page,
                    data=result.data,
                    usage=result.usage,
                )
                if result.usage:
                    for key, value in result.usage.items():
                        total_usage[key] = total_usage.get(key, 0) + value
            except Exception as e:
                page_results[idx] = PageExtractionResult(
                    page=page,
                    data=None,
                    error=str(e),
                )

    return LLMBatchExtractionResult(
        results=page_results,  # type: ignore[arg-type]
        model=model_name,
        provider=provider,
        total_usage=total_usage if total_usage else None,
    )

extract_structured_parallel_async¶

Async parallel extraction.

unifex.llm_factory.extract_structured_parallel_async `async` ¶

extract_structured_parallel_async(
    path: Path | str,
    model: str,
    *,
    schema: type[T],
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMBatchExtractionResult[T]

extract_structured_parallel_async(
    path: Path | str,
    model: str,
    *,
    schema: None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: Any = None,
) -> LLMBatchExtractionResult[dict[str, Any]]

extract_structured_parallel_async(
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: AsyncSingleExtractor[T] | None = None,
) -> LLMBatchExtractionResult[T | dict[str, Any]]

Async parallel extraction (one page per request).

Each page is extracted in a separate async request, with concurrency limited by max_workers via semaphore. Errors on individual pages are captured in the result, not raised.

Parameters:

Name	Type	Description	Default
`path`	`Path \| str`	Path to document/image file.	required
`model`	`str`	Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").	required
`schema`	`type[T] \| None`	Pydantic model for structured output. None for free-form dict.	`None`
`prompt`	`str \| None`	Custom extraction prompt. Auto-generated from schema if None.	`None`
`pages`	`list[int] \| None`	Page numbers to extract from (0-indexed). None for all pages.	`None`
`max_workers`	`int`	Number of concurrent requests (semaphore limit).	`4`
`dpi`	`int`	DPI for PDF-to-image conversion.	`200`
`max_retries`	`int`	Max retry attempts with validation feedback.	`3`
`temperature`	`float`	Sampling temperature (0.0 = deterministic).	`0.0`
`credentials`	`dict[str, str] \| None`	Override credentials dict (otherwise uses env vars).	`None`
`base_url`	`str \| None`	Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).	`None`
`headers`	`dict[str, str] \| None`	Custom HTTP headers for OpenAI-compatible APIs.	`None`
`_extractor`	`AsyncSingleExtractor[T] \| None`	Internal parameter for dependency injection (testing only).	`None`

Returns:

Type	Description
`LLMBatchExtractionResult[T \| dict[str, Any]]`	LLMBatchExtractionResult containing
`LLMBatchExtractionResult[T \| dict[str, Any]]`	per-page PageExtractionResult with data or errors.
`LLMBatchExtractionResult[T \| dict[str, Any]]`	Results are guaranteed to be in the same order as the input pages,
`LLMBatchExtractionResult[T \| dict[str, Any]]`	i.e., `result.results[i]` corresponds to `pages[i]`.

Source code in unifex/llm_factory.py

async def extract_structured_parallel_async(  # noqa: PLR0913
    path: Path | str,
    model: str,
    *,
    schema: type[T] | None = None,
    prompt: str | None = None,
    pages: list[int] | None = None,
    max_workers: int = 4,
    dpi: int = 200,
    max_retries: int = 3,
    temperature: float = 0.0,
    credentials: dict[str, str] | None = None,
    base_url: str | None = None,
    headers: dict[str, str] | None = None,
    _extractor: AsyncSingleExtractor[T] | None = None,
) -> LLMBatchExtractionResult[T | dict[str, Any]]:
    """Async parallel extraction (one page per request).

    Each page is extracted in a separate async request, with concurrency
    limited by max_workers via semaphore. Errors on individual pages
    are captured in the result, not raised.

    Args:
        path: Path to document/image file.
        model: Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet").
        schema: Pydantic model for structured output. None for free-form dict.
        prompt: Custom extraction prompt. Auto-generated from schema if None.
        pages: Page numbers to extract from (0-indexed). None for all pages.
        max_workers: Number of concurrent requests (semaphore limit).
        dpi: DPI for PDF-to-image conversion.
        max_retries: Max retry attempts with validation feedback.
        temperature: Sampling temperature (0.0 = deterministic).
        credentials: Override credentials dict (otherwise uses env vars).
        base_url: Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.).
        headers: Custom HTTP headers for OpenAI-compatible APIs.
        _extractor: Internal parameter for dependency injection (testing only).

    Returns:
        [LLMBatchExtractionResult][unifex.llm.models.LLMBatchExtractionResult] containing
        per-page [PageExtractionResult][unifex.llm.models.PageExtractionResult] with data or errors.
        Results are guaranteed to be in the same order as the input pages,
        i.e., `result.results[i]` corresponds to `pages[i]`.
    """
    path = Path(path) if isinstance(path, str) else path
    provider, model_name = _parse_model_string(model)
    extractor = _extractor or _extract_single_async

    # Get all pages if not specified
    if pages is None:
        from unifex.base import ImageLoader

        loader = ImageLoader(path, dpi=dpi)
        pages = list(range(loader.page_count))
        loader.close()

    # Parallel async execution with semaphore to limit concurrency
    semaphore = asyncio.Semaphore(max_workers)

    async def extract_page(
        page: int,
    ) -> PageExtractionResult[T | dict[str, Any]]:
        async with semaphore:
            try:
                result = await extractor(
                    path,
                    model,
                    schema,
                    prompt,
                    [page],
                    dpi,
                    max_retries,
                    temperature,
                    credentials,
                    base_url,
                    headers,
                )
                return PageExtractionResult(
                    page=page,
                    data=result.data,
                    usage=result.usage,
                )
            except Exception as e:
                return PageExtractionResult(
                    page=page,
                    data=None,
                    error=str(e),
                )

    # Run all extractions concurrently (limited by semaphore)
    tasks = [extract_page(page) for page in pages]
    page_results = await asyncio.gather(*tasks)

    # Aggregate usage
    total_usage: dict[str, int] = {}
    for pr in page_results:
        if pr.usage:
            for key, value in pr.usage.items():
                total_usage[key] = total_usage.get(key, 0) + value

    return LLMBatchExtractionResult(
        results=list(page_results),
        model=model_name,
        provider=provider,
        total_usage=total_usage if total_usage else None,
    )

API Reference¶

Factory Function¶

create_extractor¶

unifex.create_extractor ¶

Extractor Types¶

ExtractorType¶

unifex.ExtractorType ¶

Coordinate Units¶

CoordinateUnit¶

unifex.CoordinateUnit ¶

Executor Types¶

ExecutorType¶

unifex.base.ExecutorType ¶

LLM Extraction¶

extract_structured¶

unifex.llm_factory.extract_structured ¶

extract_structured_async¶

unifex.llm_factory.extract_structured_async async ¶

extract_structured_parallel¶

unifex.llm_factory.extract_structured_parallel ¶

extract_structured_parallel_async¶

unifex.llm_factory.extract_structured_parallel_async async ¶

unifex.llm_factory.extract_structured_async `async` ¶

unifex.llm_factory.extract_structured_parallel_async `async` ¶