OCR Extraction¶
unifex supports multiple OCR backends for extracting text from images and scanned PDFs.
Language Codes¶
All OCR extractors use 2-letter ISO 639-1 language codes (e.g., "en", "fr", "de", "it").
Extractors that require different formats (like Tesseract) convert internally.
Local OCR Engines¶
EasyOCR¶
from unifex import EasyOcrExtractor
# For images
with EasyOcrExtractor("image.png", languages=["en"]) as extractor:
result = extractor.extract()
# For PDFs (auto-converts to images)
with EasyOcrExtractor("scanned.pdf", languages=["en"], dpi=200) as extractor:
result = extractor.extract()
Tesseract¶
Requires Tesseract to be installed on the system:
- macOS:
brew install tesseract - Ubuntu:
apt-get install tesseract-ocr - Windows: Download from UB-Mannheim/tesseract
from unifex import TesseractOcrExtractor
# For images
with TesseractOcrExtractor("image.png", languages=["en"]) as extractor:
result = extractor.extract()
# For PDFs (auto-converts to images)
with TesseractOcrExtractor("scanned.pdf", languages=["en"], dpi=200) as extractor:
result = extractor.extract()
PaddleOCR¶
Excellent accuracy for multiple languages, especially Chinese.
from unifex import PaddleOcrExtractor
# For images
with PaddleOcrExtractor("image.png", lang="en") as extractor:
result = extractor.extract()
# For PDFs (auto-converts to images)
with PaddleOcrExtractor("scanned.pdf", lang="en", dpi=200) as extractor:
result = extractor.extract()
# For Chinese text
with PaddleOcrExtractor("chinese_doc.png", lang="ch") as extractor:
result = extractor.extract()
Cloud OCR Services¶
Azure Document Intelligence¶
from unifex import AzureDocumentIntelligenceExtractor
with AzureDocumentIntelligenceExtractor(
"document.pdf",
endpoint="https://your-resource.cognitiveservices.azure.com",
key="your-api-key",
) as extractor:
result = extractor.extract()
Or use environment variables:
export UNIFEX_AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com
export UNIFEX_AZURE_DI_KEY=your-api-key
Google Document AI¶
from unifex import GoogleDocumentAIExtractor
with GoogleDocumentAIExtractor(
"document.pdf",
processor_name="projects/your-project/locations/us/processors/your-processor-id",
credentials_path="/path/to/service-account.json",
) as extractor:
result = extractor.extract()
Or use environment variables:
export UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME=projects/your-project/locations/us/processors/123
export UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH=/path/to/credentials.json
Parallel Extraction¶
All OCR extractors support parallel page extraction:
from unifex import EasyOcrExtractor
with EasyOcrExtractor("scanned.pdf", languages=["en"]) as extractor:
result = extractor.extract(max_workers=4)
See Parallel Processing for more details.
Coordinate Units¶
Control the output coordinate system:
from unifex import EasyOcrExtractor, CoordinateUnit
# Pixels (default for OCR, uses DPI for conversion)
with EasyOcrExtractor("image.png", languages=["en"],
output_unit=CoordinateUnit.PIXELS, dpi=150) as extractor:
result = extractor.extract()
# Points (1/72 inch)
with EasyOcrExtractor("image.png", languages=["en"],
output_unit=CoordinateUnit.POINTS) as extractor:
result = extractor.extract()
# Normalized (0-1 range)
with EasyOcrExtractor("image.png", languages=["en"],
output_unit=CoordinateUnit.NORMALIZED) as extractor:
result = extractor.extract()
Choosing an OCR Engine¶
| Engine | Best For | Speed | Accuracy |
|---|---|---|---|
| EasyOCR | General purpose, many languages | Medium | High |
| Tesseract | Fast processing, good accuracy | Fast | Medium-High |
| PaddleOCR | Chinese text, high accuracy | Medium | Very High |
| Azure DI | Production workloads, tables | Fast | Very High |
| Google DocAI | Production workloads, forms | Fast | Very High |