Architecture¶
Project Structure¶
unifex/
├── cli.py # Command-line interface
├── coordinates.py # Coordinate unit conversions
├── models.py # Core data models
├── text_factory.py # Unified factory interface
├── base/ # Base classes and models
│ ├── base.py # BaseExtractor class
│ ├── models.py # Document, Page, TextBlock, etc.
│ └── coordinates.py # Coordinate conversion logic
├── pdf/ # PDF text extraction
│ ├── pdf.py # PdfExtractor implementation
│ └── character_mergers.py # Text merging strategies
├── ocr/ # OCR extraction
│ ├── adapters/ # External API → internal models
│ │ ├── azure_di.py
│ │ ├── google_docai.py
│ │ ├── easy_ocr.py
│ │ ├── paddle_ocr.py
│ │ └── tesseract_ocr.py
│ └── extractors/ # OCR extractor implementations
│ ├── azure_di.py
│ ├── google_docai.py
│ ├── easy_ocr.py
│ ├── paddle_ocr.py
│ └── tesseract_ocr.py
├── llm/ # LLM-based extraction
│ ├── factory.py # extract_structured functions
│ ├── models.py # LLM-specific models
│ ├── adapters/
│ │ └── image_encoder.py
│ └── extractors/
│ ├── anthropic.py
│ ├── openai.py
│ ├── azure_openai.py
│ └── google.py
└── utils/ # Shared utilities
├── geometry.py
└── image_loader.py
Layered Architecture¶
The project follows a layered architecture enforced by import-linter:
Rules¶
- OCR and LLM are independent - They don't import from each other
- Base has no upward dependencies - It doesn't import from pdf, ocr, llm, or cli
- OCR extractors are independent - Each OCR extractor is self-contained
Adapter Pattern¶
Adapters transform external API responses to internal models:
This keeps extractors clean and makes it easy to: - Add new OCR providers - Update when APIs change - Test transformations in isolation
Extractor Interface¶
All extractors implement BaseExtractor:
class BaseExtractor:
def extract(
self,
pages: Sequence[int] | None = None,
max_workers: int = 1,
executor: ExecutorType = ExecutorType.THREAD,
**kwargs,
) -> ExtractionResult: ...
async def extract_async(
self,
pages: Sequence[int] | None = None,
max_workers: int = 1,
**kwargs,
) -> ExtractionResult: ...
def extract_page(self, page: int, **kwargs) -> PageExtractionResult: ...
def get_page_count(self) -> int: ...
def close(self) -> None: ...
Thread Safety¶
PDF Extractor¶
The PDF extractor uses pypdfium2, which is not thread-safe. To enable parallel page extraction, PdfExtractor uses an internal threading.Lock per instance:
class PdfExtractor:
def __init__(self, ...):
self._lock = threading.Lock()
def extract_page(self, page: int, ...):
with self._lock:
pdf_page = self._pdf[page]
# ... extraction logic
This means:
- Thread executor works - Multiple threads can call
extract_page()on the same extractor instance; the lock serializes PDF access - Process executor duplicates - Each process gets its own
PdfExtractorinstance with its own PDF handle - Single extractor, multiple threads - Safe due to internal locking
OCR Extractors¶
OCR extractors have varying thread-safety characteristics:
| Extractor | Thread-Safe | Notes |
|---|---|---|
| EasyOCR | Yes | Model shared across threads |
| Tesseract | Yes | Subprocess-based |
| PaddleOCR | Yes | Model shared across threads |
| Azure DI | Yes | HTTP client is thread-safe |
| Google DocAI | Yes | gRPC client is thread-safe |
LLM Extractors¶
All LLM extractors are thread-safe as they use HTTP/gRPC clients that handle concurrent requests properly.
Coordinate System¶
All coordinates flow through a conversion pipeline:
- PDF uses points natively
- OCR uses pixels natively (at specified DPI)
- Cloud APIs may use normalized coordinates
The CoordinateConverter handles all conversions.