unifex¶

A Python library for document text extraction with local and cloud OCR solutions.

Overview¶

unifex is built for tasks like fraud detection where precision matters. It provides a universal tool for both PDF and image processing with best-in-class OCR support through local engines and cloud services.

Key Features¶

Multiple OCR Backends: Local (EasyOCR, Tesseract, PaddleOCR) and cloud (Azure Document Intelligence, Google Document AI)
PDF Text Extraction: Native PDF text extraction using pypdfium2
LLM Extraction: Extract structured data using GPT-4o, Claude, Gemini, or OpenAI-compatible APIs
Parallel Extraction: Process multiple pages concurrently with thread or process executors
Async Support: Native async/await API for integration with async applications
Unified Extractors: Each OCR extractor auto-detects file type (PDF vs image) and handles conversion internally
Pydantic Models: Type-safe document representation with pydantic v1/v2 compatibility

Quick Example¶

from unifex import create_extractor, ExtractorType

# PDF extraction (native text)
with create_extractor("document.pdf", ExtractorType.PDF) as extractor:
    result = extractor.extract()
    print(f"Extracted {len(result.document.pages)} pages")

Alternatives¶

For broader document processing, check out Docling and Kreuzberg.

License¶

BSD 3-Clause License. See LICENSE for details.