Testing¶

Running Tests¶

# Run all tests
uv run pytest

# Run fast tests only (unit tests, <0.5s per test)
uv run pytest tests/base tests/ocr

# Run integration tests only (slow, load ML models)
uv run pytest tests/integration

# Run with coverage
uv run pytest --cov=unifex --cov-report=term-missing

Test Structure¶

tests/
├── base/           # Fast unit tests (<0.5s each) - run in pre-commit
├── ocr/            # OCR adapter unit tests (mocked) - run in pre-commit
├── llm/            # LLM unit tests (mocked) - run in pre-commit
└── integration/    # Slow tests - NOT in pre-commit
    ├── ocr/        # OCR integration tests (load real ML models)
    └── llm/        # LLM integration tests (call real APIs)

Pre-commit runs: tests/base, tests/ocr, and tests/llm with 0.5s timeout per test.

CI runs: All tests including integration tests.

Test Data¶

Test files are located in tests/data/:

test_pdf_2p_text.pdf - 2-page PDF with text
test_pdf_2p_text_rotated.pdf - 2-page PDF with rotated text
test_pdf_table.pdf - PDF with tables
test_image.png - Test image for OCR

Integration Tests¶

Integration tests load real ML models and call real services.

Local Extractors (No Credentials Required)¶

PdfExtractor - Tests PDF text extraction
EasyOcrExtractor - Tests image and PDF OCR with EasyOCR
TesseractOcrExtractor - Tests image and PDF OCR with Tesseract
PaddleOcrExtractor - Tests image and PDF OCR with PaddleOCR

Cloud Extractors (Require Credentials)¶

Tests are automatically skipped if credentials are not configured.

Azure Setup¶

cp .env.example .env
# Edit .env with your credentials:
# UNIFEX_AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com
# UNIFEX_AZURE_DI_KEY=your-api-key

# Run tests
export $(cat .env | xargs)
uv run pytest tests/integration -v

Google Setup¶

# Edit .env:
# UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME=projects/your-project/locations/us/processors/123
# UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH=/path/to/service-account.json

export $(cat .env | xargs)
uv run pytest tests/integration -v

TDD Workflow¶

This project follows Test-Driven Development:

Red - Write a failing test first
Green - Write minimal code to pass the test
Refactor - Clean up while keeping tests green

VCR Cassettes¶

For API tests, we use VCR to record HTTP interactions:

@pytest.mark.vcr()
def test_api_call():
    # First run records the cassette
    # Subsequent runs replay it
    ...

Cassettes are stored in tests/cassettes/.