Architecture¶

Project Structure¶

unifex/
├── cli.py              # Command-line interface
├── coordinates.py      # Coordinate unit conversions
├── models.py           # Core data models
├── text_factory.py     # Unified factory interface
├── base/               # Base classes and models
│   ├── base.py         # BaseExtractor class
│   ├── models.py       # Document, Page, TextBlock, etc.
│   └── coordinates.py  # Coordinate conversion logic
├── pdf/                # PDF text extraction
│   ├── pdf.py          # PdfExtractor implementation
│   └── character_mergers.py  # Text merging strategies
├── ocr/                # OCR extraction
│   ├── adapters/       # External API → internal models
│   │   ├── azure_di.py
│   │   ├── google_docai.py
│   │   ├── easy_ocr.py
│   │   ├── paddle_ocr.py
│   │   └── tesseract_ocr.py
│   └── extractors/     # OCR extractor implementations
│       ├── azure_di.py
│       ├── google_docai.py
│       ├── easy_ocr.py
│       ├── paddle_ocr.py
│       └── tesseract_ocr.py
├── llm/                # LLM-based extraction
│   ├── factory.py      # extract_structured functions
│   ├── models.py       # LLM-specific models
│   ├── adapters/
│   │   └── image_encoder.py
│   └── extractors/
│       ├── anthropic.py
│       ├── openai.py
│       ├── azure_openai.py
│       └── google.py
└── utils/              # Shared utilities
    ├── geometry.py
    └── image_loader.py

Layered Architecture¶

The project follows a layered architecture enforced by import-linter:

cli.py
   ↓
text_factory.py
   ↓
pdf/, ocr/, llm/
   ↓
base/

Rules¶

OCR and LLM are independent - They don't import from each other
Base has no upward dependencies - It doesn't import from pdf, ocr, llm, or cli
OCR extractors are independent - Each OCR extractor is self-contained

Adapter Pattern¶

Adapters transform external API responses to internal models:

External API Response → Adapter → Page/TextBlock

This keeps extractors clean and makes it easy to: - Add new OCR providers - Update when APIs change - Test transformations in isolation

Extractor Interface¶

All extractors implement BaseExtractor:

class BaseExtractor:
    def extract(
        self,
        pages: Sequence[int] | None = None,
        max_workers: int = 1,
        executor: ExecutorType = ExecutorType.THREAD,
        **kwargs,
    ) -> ExtractionResult: ...

    async def extract_async(
        self,
        pages: Sequence[int] | None = None,
        max_workers: int = 1,
        **kwargs,
    ) -> ExtractionResult: ...

    def extract_page(self, page: int, **kwargs) -> PageExtractionResult: ...

    def get_page_count(self) -> int: ...

    def close(self) -> None: ...

Thread Safety¶

PDF Extractor¶

The PDF extractor uses pypdfium2, which is not thread-safe. To enable parallel page extraction, PdfExtractor uses an internal threading.Lock per instance:

class PdfExtractor:
    def __init__(self, ...):
        self._lock = threading.Lock()

    def extract_page(self, page: int, ...):
        with self._lock:
            pdf_page = self._pdf[page]
            # ... extraction logic

This means:

Thread executor works - Multiple threads can call extract_page() on the same extractor instance; the lock serializes PDF access
Process executor duplicates - Each process gets its own PdfExtractor instance with its own PDF handle
Single extractor, multiple threads - Safe due to internal locking

OCR Extractors¶

OCR extractors have varying thread-safety characteristics:

Extractor	Thread-Safe	Notes
EasyOCR	Yes	Model shared across threads
Tesseract	Yes	Subprocess-based
PaddleOCR	Yes	Model shared across threads
Azure DI	Yes	HTTP client is thread-safe
Google DocAI	Yes	gRPC client is thread-safe

LLM Extractors¶

All LLM extractors are thread-safe as they use HTTP/gRPC clients that handle concurrent requests properly.

Coordinate System¶

All coordinates flow through a conversion pipeline:

Native Unit → Points → Output Unit

PDF uses points natively
OCR uses pixels natively (at specified DPI)
Cloud APIs may use normalized coordinates

The CoordinateConverter handles all conversions.