LLM Extraction¶

Extract structured data from documents using vision-capable LLMs.

Supported Providers¶

OpenAI: GPT-4o, GPT-4o-mini
Anthropic: Claude Sonnet, Claude Opus
Google: Gemini Pro, Gemini Flash
Azure OpenAI: GPT-4o via Azure
OpenAI-Compatible: vLLM, Ollama, and other compatible APIs

Basic Usage¶

Free-form Extraction¶

from unifex.llm import extract_structured

result = extract_structured(
    "document.pdf",
    model="openai/gpt-4o",
)
print(result.data)

With Custom Prompt¶

from unifex.llm import extract_structured

result = extract_structured(
    "image.png",
    model="anthropic/claude-sonnet-4-20250514",
    prompt="Extract all visible text from this image",
)

Structured Extraction with Pydantic¶

Define a Pydantic model for type-safe structured output:

from pydantic import BaseModel
from unifex.llm import extract_structured

class DocumentContent(BaseModel):
    title: str | None
    paragraphs: list[str]

result = extract_structured(
    "document.pdf",
    model="openai/gpt-4o",
    schema=DocumentContent,
)
content: DocumentContent = result.data
print(f"Found {len(content.paragraphs)} paragraphs")

OpenAI-Compatible APIs¶

Use custom base URLs for self-hosted or alternative APIs:

from unifex.llm import extract_structured

# vLLM server
result = extract_structured(
    "document.pdf",
    model="openai/meta-llama/Llama-3.2-90B-Vision-Instruct",
    base_url="http://localhost:8000/v1",
)

# Ollama
result = extract_structured(
    "image.png",
    model="openai/llava",
    base_url="http://localhost:11434/v1",
)

# With custom headers
result = extract_structured(
    "document.pdf",
    model="openai/gpt-4o",
    base_url="https://your-proxy.com/v1",
    headers={"X-Custom-Auth": "your-token"},
)

Parallel Extraction¶

Process multiple pages in parallel for faster extraction using extract_structured_parallel:

from unifex.llm import extract_structured, extract_structured_parallel

# Sequential: all pages sent in one request
result = extract_structured("document.pdf", model="openai/gpt-4o")
# result.data is the extracted data

# Parallel: each page processed separately with 4 concurrent workers
batch_result = extract_structured_parallel(
    "document.pdf",
    model="openai/gpt-4o",
    max_workers=4,
)
# batch_result.results is a list of PageExtractionResult
# batch_result.total_usage contains aggregated token usage

for page_result in batch_result.results:
    if page_result.error:
        print(f"Page {page_result.page} failed: {page_result.error}")
    else:
        print(f"Page {page_result.page}: {page_result.data}")

Results are guaranteed to be in the same order as input pages.

Async API¶

import asyncio
from unifex.llm import extract_structured_async, extract_structured_parallel_async

async def extract():
    # Single request
    result = await extract_structured_async(
        "document.pdf",
        model="openai/gpt-4o",
    )
    return result.data

async def extract_parallel():
    # Parallel requests
    batch_result = await extract_structured_parallel_async(
        "document.pdf",
        model="openai/gpt-4o",
        max_workers=4,
    )
    return batch_result.results

data = asyncio.run(extract())

Environment Variables¶

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic API key
`GOOGLE_API_KEY`	Google AI API key
`AZURE_OPENAI_API_KEY`	Azure OpenAI API key
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint URL
`AZURE_OPENAI_API_VERSION`	Azure OpenAI API version