Extractors Reference¶
PDF Extractor¶
PdfExtractor¶
Native PDF text extraction using pypdfium2.
unifex.pdf.PdfExtractor ¶
Bases: BaseExtractor
Extract text and metadata from PDF files using pypdfium2.
Source code in unifex/pdf/pdf.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | |
extract_page ¶
Extract a single page by number (0-indexed).
Thread-safe: uses internal lock for parallel access.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
int
|
Page number (0-indexed). |
required |
table_options
|
dict[str, Any] | None
|
Optional dict of tabula options for table extraction. If provided, tables will be extracted and added to Page.tables. Common options: lattice, stream, columns, area, guess, multiple_tables. |
None
|
Source code in unifex/pdf/pdf.py
extract_tables ¶
extract_tables(
pages: Sequence[int] | None = None,
table_options: dict[str, Any] | None = None,
) -> list[Table]
Extract tables from PDF pages using tabula.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pages
|
Sequence[int] | None
|
Sequence of page numbers to extract (0-indexed). If None, extracts from all pages. |
None
|
table_options
|
dict[str, Any] | None
|
Dict of tabula options. Common options: - lattice: bool - Use lattice mode (tables with cell borders) - stream: bool - Use stream mode (tables without borders) - columns: list[float] - Column x-coordinates for splitting - area: tuple[float, float, float, float] - (top, left, bottom, right) - guess: bool - Guess table areas automatically - multiple_tables: bool - Extract multiple tables per page - pandas_options: dict - Options for pandas |
None
|
Returns:
| Type | Description |
|---|---|
list[Table]
|
List of Table objects with page field indicating source page. |
Source code in unifex/pdf/pdf.py
Local OCR Extractors¶
EasyOcrExtractor¶
OCR using EasyOCR library.
unifex.ocr.extractors.easy_ocr.EasyOcrExtractor ¶
Bases: BaseExtractor
Extract text from images or PDFs using EasyOCR.
Composes ImageLoader for image handling, EasyOCR for OCR processing, and EasyOCRAdapter for result conversion.
Source code in unifex/ocr/extractors/easy_ocr.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
__init__ ¶
__init__(
path: Path | str,
languages: list[str] | None = None,
gpu: bool = False,
dpi: int = 200,
output_unit: CoordinateUnit = CoordinateUnit.POINTS,
) -> None
Initialize EasyOCR extractor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the image or PDF file (Path object or string). |
required |
languages
|
list[str] | None
|
List of language codes for OCR. Defaults to ["en"]. |
None
|
gpu
|
bool
|
Whether to use GPU acceleration. |
False
|
dpi
|
int
|
DPI for PDF-to-image conversion. Default 200. |
200
|
output_unit
|
CoordinateUnit
|
Coordinate unit for output. Default POINTS. |
POINTS
|
Source code in unifex/ocr/extractors/easy_ocr.py
get_page_count ¶
extract_page ¶
Extract text from a single image/page.
Source code in unifex/ocr/extractors/easy_ocr.py
get_extractor_metadata ¶
Return extractor metadata.
Source code in unifex/ocr/extractors/easy_ocr.py
get_init_params ¶
Return parameters for recreating this extractor in a worker process.
Source code in unifex/ocr/extractors/easy_ocr.py
TesseractOcrExtractor¶
OCR using Tesseract.
unifex.ocr.extractors.tesseract_ocr.TesseractOcrExtractor ¶
Bases: BaseExtractor
Extract text from images or PDFs using Tesseract OCR.
Composes ImageLoader for image handling, Tesseract for OCR processing, and TesseractAdapter for result conversion.
Source code in unifex/ocr/extractors/tesseract_ocr.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | |
__init__ ¶
__init__(
path: Path | str,
languages: list[str] | None = None,
dpi: int = 200,
output_unit: CoordinateUnit = CoordinateUnit.POINTS,
) -> None
Initialize Tesseract OCR extractor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the image or PDF file (Path object or string). |
required |
languages
|
list[str] | None
|
List of 2-letter ISO 639-1 language codes (e.g., ["en", "fr"]). Defaults to ["en"]. Codes are converted to Tesseract format internally. |
None
|
dpi
|
int
|
DPI for PDF-to-image conversion. Default 200. |
200
|
output_unit
|
CoordinateUnit
|
Coordinate unit for output. Default POINTS. |
POINTS
|
Source code in unifex/ocr/extractors/tesseract_ocr.py
get_page_count ¶
extract_page ¶
Extract text from a single image/page.
Source code in unifex/ocr/extractors/tesseract_ocr.py
get_extractor_metadata ¶
Return extractor metadata.
Source code in unifex/ocr/extractors/tesseract_ocr.py
get_init_params ¶
Return parameters for recreating this extractor in a worker process.
Source code in unifex/ocr/extractors/tesseract_ocr.py
PaddleOcrExtractor¶
OCR using PaddleOCR.
unifex.ocr.extractors.paddle_ocr.PaddleOcrExtractor ¶
Bases: BaseExtractor
Extract text from images or PDFs using PaddleOCR.
Composes ImageLoader for image handling, PaddleOCR for OCR, and PaddleOCRAdapter for result conversion.
PaddleOCR model is loaded lazily on first extraction and cached globally.
Source code in unifex/ocr/extractors/paddle_ocr.py
100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 | |
__init__ ¶
__init__(
path: Path | str,
lang: str = "en",
use_gpu: bool = False,
dpi: int = 200,
output_unit: CoordinateUnit = CoordinateUnit.POINTS,
) -> None
Initialize PaddleOCR extractor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the image or PDF file (Path object or string). |
required |
lang
|
str
|
Language code for OCR. Common values: - "en" for English - "ch" for Chinese - "fr" for French - "german" for German - "japan" for Japanese - "korean" for Korean See PaddleOCR docs for full list. |
'en'
|
use_gpu
|
bool
|
Whether to use GPU acceleration. |
False
|
dpi
|
int
|
DPI for PDF-to-image conversion. Default 200. |
200
|
output_unit
|
CoordinateUnit
|
Coordinate unit for output. Default POINTS. |
POINTS
|
Source code in unifex/ocr/extractors/paddle_ocr.py
get_page_count ¶
extract_page ¶
Extract text from a single image/page.
Source code in unifex/ocr/extractors/paddle_ocr.py
extract_tables ¶
Extract tables from document using PPStructure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pages
|
list[int] | None
|
List of page numbers to extract (0-indexed). If None, extracts from all pages. |
None
|
Returns:
| Type | Description |
|---|---|
list[Table]
|
List of Table objects with page field indicating source page. |
Source code in unifex/ocr/extractors/paddle_ocr.py
get_extractor_metadata ¶
Return extractor metadata.
Source code in unifex/ocr/extractors/paddle_ocr.py
get_init_params ¶
Return parameters for recreating this extractor in a worker process.
Source code in unifex/ocr/extractors/paddle_ocr.py
Cloud OCR Extractors¶
AzureDocumentIntelligenceExtractor¶
Azure Document Intelligence OCR.
unifex.ocr.extractors.azure_di.AzureDocumentIntelligenceExtractor ¶
Bases: BaseExtractor
Extract text from documents using Azure Document Intelligence.
Source code in unifex/ocr/extractors/azure_di.py
extract_page ¶
Extract a single page by number (0-indexed).
Source code in unifex/ocr/extractors/azure_di.py
GoogleDocumentAIExtractor¶
Google Document AI OCR.
unifex.ocr.extractors.google_docai.GoogleDocumentAIExtractor ¶
Bases: BaseExtractor
Extract text from documents using Google Document AI.
Source code in unifex/ocr/extractors/google_docai.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |
__init__ ¶
__init__(
path: Path | str,
processor_name: str,
credentials_path: str,
mime_type: str | None = None,
output_unit: CoordinateUnit = CoordinateUnit.POINTS,
) -> None
Initialize Google Document AI extractor.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to the document file. |
required |
processor_name
|
str
|
Full processor resource name, e.g., 'projects/{project}/locations/{location}/processors/{processor_id}' |
required |
credentials_path
|
str
|
Path to service account JSON credentials file. |
required |
mime_type
|
str | None
|
Optional MIME type. If not provided, will be inferred from file extension. |
None
|
output_unit
|
CoordinateUnit
|
Coordinate unit for output. Default POINTS. |
POINTS
|
Source code in unifex/ocr/extractors/google_docai.py
extract_page ¶
Extract a single page by number (0-indexed).
Source code in unifex/ocr/extractors/google_docai.py
LLM Extractors¶
extract_structured¶
Synchronous LLM extraction function.
unifex.llm_factory.extract_structured ¶
extract_structured(
path: Path | str,
model: str,
*,
schema: type[T],
prompt: str | None = None,
pages: list[int] | None = None,
dpi: int = 200,
max_retries: int = 3,
temperature: float = 0.0,
credentials: dict[str, str] | None = None,
base_url: str | None = None,
headers: dict[str, str] | None = None,
_extractor: Any = None,
) -> LLMExtractionResult[T]
extract_structured(
path: Path | str,
model: str,
*,
schema: None = None,
prompt: str | None = None,
pages: list[int] | None = None,
dpi: int = 200,
max_retries: int = 3,
temperature: float = 0.0,
credentials: dict[str, str] | None = None,
base_url: str | None = None,
headers: dict[str, str] | None = None,
_extractor: Any = None,
) -> LLMExtractionResult[dict[str, Any]]
extract_structured(
path: Path | str,
model: str,
*,
schema: type[T] | None = None,
prompt: str | None = None,
pages: list[int] | None = None,
dpi: int = 200,
max_retries: int = 3,
temperature: float = 0.0,
credentials: dict[str, str] | None = None,
base_url: str | None = None,
headers: dict[str, str] | None = None,
_extractor: SingleExtractor[T] | None = None,
) -> LLMExtractionResult[T | dict[str, Any]]
Extract structured data from a document using an LLM.
All specified pages are sent in a single request.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to document/image file. |
required |
model
|
str
|
Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet"). |
required |
schema
|
type[T] | None
|
Pydantic model for structured output. None for free-form dict. |
None
|
prompt
|
str | None
|
Custom extraction prompt. Auto-generated from schema if None. |
None
|
pages
|
list[int] | None
|
Page numbers to extract from (0-indexed). None for all pages. |
None
|
dpi
|
int
|
DPI for PDF-to-image conversion. |
200
|
max_retries
|
int
|
Max retry attempts with validation feedback. |
3
|
temperature
|
float
|
Sampling temperature (0.0 = deterministic). |
0.0
|
credentials
|
dict[str, str] | None
|
Override credentials dict (otherwise uses env vars). |
None
|
base_url
|
str | None
|
Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.). |
None
|
headers
|
dict[str, str] | None
|
Custom HTTP headers for OpenAI-compatible APIs. |
None
|
_extractor
|
SingleExtractor[T] | None
|
Internal parameter for dependency injection (testing only). |
None
|
Returns:
| Type | Description |
|---|---|
LLMExtractionResult[T | dict[str, Any]]
|
LLMExtractionResult containing extracted data, |
LLMExtractionResult[T | dict[str, Any]]
|
model info, and provider. |
Source code in unifex/llm_factory.py
extract_structured_async¶
Asynchronous LLM extraction function.
unifex.llm_factory.extract_structured_async
async
¶
extract_structured_async(
path: Path | str,
model: str,
*,
schema: type[T],
prompt: str | None = None,
pages: list[int] | None = None,
dpi: int = 200,
max_retries: int = 3,
temperature: float = 0.0,
credentials: dict[str, str] | None = None,
base_url: str | None = None,
headers: dict[str, str] | None = None,
_extractor: Any = None,
) -> LLMExtractionResult[T]
extract_structured_async(
path: Path | str,
model: str,
*,
schema: None = None,
prompt: str | None = None,
pages: list[int] | None = None,
dpi: int = 200,
max_retries: int = 3,
temperature: float = 0.0,
credentials: dict[str, str] | None = None,
base_url: str | None = None,
headers: dict[str, str] | None = None,
_extractor: Any = None,
) -> LLMExtractionResult[dict[str, Any]]
extract_structured_async(
path: Path | str,
model: str,
*,
schema: type[T] | None = None,
prompt: str | None = None,
pages: list[int] | None = None,
dpi: int = 200,
max_retries: int = 3,
temperature: float = 0.0,
credentials: dict[str, str] | None = None,
base_url: str | None = None,
headers: dict[str, str] | None = None,
_extractor: AsyncSingleExtractor[T] | None = None,
) -> LLMExtractionResult[T | dict[str, Any]]
Async version of extract_structured.
All specified pages are sent in a single request.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path | str
|
Path to document/image file. |
required |
model
|
str
|
Model identifier (e.g., "openai/gpt-4o", "anthropic/claude-3-5-sonnet"). |
required |
schema
|
type[T] | None
|
Pydantic model for structured output. None for free-form dict. |
None
|
prompt
|
str | None
|
Custom extraction prompt. Auto-generated from schema if None. |
None
|
pages
|
list[int] | None
|
Page numbers to extract from (0-indexed). None for all pages. |
None
|
dpi
|
int
|
DPI for PDF-to-image conversion. |
200
|
max_retries
|
int
|
Max retry attempts with validation feedback. |
3
|
temperature
|
float
|
Sampling temperature (0.0 = deterministic). |
0.0
|
credentials
|
dict[str, str] | None
|
Override credentials dict (otherwise uses env vars). |
None
|
base_url
|
str | None
|
Custom API base URL for OpenAI-compatible APIs (vLLM, Ollama, etc.). |
None
|
headers
|
dict[str, str] | None
|
Custom HTTP headers for OpenAI-compatible APIs. |
None
|
_extractor
|
AsyncSingleExtractor[T] | None
|
Internal parameter for dependency injection (testing only). |
None
|
Returns:
| Type | Description |
|---|---|
LLMExtractionResult[T | dict[str, Any]]
|
LLMExtractionResult containing extracted data, |
LLMExtractionResult[T | dict[str, Any]]
|
model info, and provider. |