OCR Extraction¶

unifex supports multiple OCR backends for extracting text from images and scanned PDFs.

Language Codes¶

All OCR extractors use 2-letter ISO 639-1 language codes (e.g., "en", "fr", "de", "it"). Extractors that require different formats (like Tesseract) convert internally.

Local OCR Engines¶

EasyOCR¶

from unifex import EasyOcrExtractor

# For images
with EasyOcrExtractor("image.png", languages=["en"]) as extractor:
    result = extractor.extract()

# For PDFs (auto-converts to images)
with EasyOcrExtractor("scanned.pdf", languages=["en"], dpi=200) as extractor:
    result = extractor.extract()

Tesseract¶

Requires Tesseract to be installed on the system:

macOS: brew install tesseract
Ubuntu: apt-get install tesseract-ocr
Windows: Download from UB-Mannheim/tesseract

from unifex import TesseractOcrExtractor

# For images
with TesseractOcrExtractor("image.png", languages=["en"]) as extractor:
    result = extractor.extract()

# For PDFs (auto-converts to images)
with TesseractOcrExtractor("scanned.pdf", languages=["en"], dpi=200) as extractor:
    result = extractor.extract()

PaddleOCR¶

Excellent accuracy for multiple languages, especially Chinese.

from unifex import PaddleOcrExtractor

# For images
with PaddleOcrExtractor("image.png", lang="en") as extractor:
    result = extractor.extract()

# For PDFs (auto-converts to images)
with PaddleOcrExtractor("scanned.pdf", lang="en", dpi=200) as extractor:
    result = extractor.extract()

# For Chinese text
with PaddleOcrExtractor("chinese_doc.png", lang="ch") as extractor:
    result = extractor.extract()

Cloud OCR Services¶

Azure Document Intelligence¶

from unifex import AzureDocumentIntelligenceExtractor

with AzureDocumentIntelligenceExtractor(
    "document.pdf",
    endpoint="https://your-resource.cognitiveservices.azure.com",
    key="your-api-key",
) as extractor:
    result = extractor.extract()

Or use environment variables:

export UNIFEX_AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com
export UNIFEX_AZURE_DI_KEY=your-api-key

Google Document AI¶

from unifex import GoogleDocumentAIExtractor

with GoogleDocumentAIExtractor(
    "document.pdf",
    processor_name="projects/your-project/locations/us/processors/your-processor-id",
    credentials_path="/path/to/service-account.json",
) as extractor:
    result = extractor.extract()

Or use environment variables:

export UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME=projects/your-project/locations/us/processors/123
export UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH=/path/to/credentials.json

Parallel Extraction¶

All OCR extractors support parallel page extraction:

from unifex import EasyOcrExtractor

with EasyOcrExtractor("scanned.pdf", languages=["en"]) as extractor:
    result = extractor.extract(max_workers=4)

See Parallel Processing for more details.

Coordinate Units¶

Control the output coordinate system:

from unifex import EasyOcrExtractor, CoordinateUnit

# Pixels (default for OCR, uses DPI for conversion)
with EasyOcrExtractor("image.png", languages=["en"],
                       output_unit=CoordinateUnit.PIXELS, dpi=150) as extractor:
    result = extractor.extract()

# Points (1/72 inch)
with EasyOcrExtractor("image.png", languages=["en"],
                       output_unit=CoordinateUnit.POINTS) as extractor:
    result = extractor.extract()

# Normalized (0-1 range)
with EasyOcrExtractor("image.png", languages=["en"],
                       output_unit=CoordinateUnit.NORMALIZED) as extractor:
    result = extractor.extract()

Choosing an OCR Engine¶

Engine	Best For	Speed	Accuracy
EasyOCR	General purpose, many languages	Medium	High
Tesseract	Fast processing, good accuracy	Fast	Medium-High
PaddleOCR	Chinese text, high accuracy	Medium	Very High
Azure DI	Production workloads, tables	Fast	Very High
Google DocAI	Production workloads, forms	Fast	Very High