Misc File Extraction Processor #
Comprehensive file processing for various file types. Extracts text content, metadata, generates thumbnails, and performs face detection.
Configuration #
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
message_field | string | documents | The field in the pipeline context containing the documents to process | |
output_queue | object | null | Optional queue configuration for sending processed documents to a output queue | |
tika_endpoint | string | No | http://127.0.0.1:9998 | Apache Tika server URL for content extraction |
tika_timeout_in_seconds | int | No | 120 | Tika processing timeout for each file |
vision_model_provider | string | Yes | - | AI provider for image analysis |
vision_model | string | Yes | - | Model name for image analysis |
pigo_facefinder_path | string | Yes | - | Path to Pigo face detection binary |
chunk_size | int | Yes | - | Text chunking size for extracted content |
image_content_format | string | No | “data_uri” | Could be “data_uri” or “binary”. The format that an image will be encoded in in order to be sent to a vision model |
Example #
- file_extraction:
tika_endpoint: http://127.0.0.1:9998
tika_timeout_in_seconds: 120
chunk_size: 7000
vision_model_provider: openai
vision_model: gpt-4o
pigo_facefinder_path: /path/to/pigo/facefinder