Misc File Extraction

Misc File Extraction Processor #

Comprehensive file processing for various file types. Extracts text content, metadata, generates thumbnails, and performs face detection.

Configuration #

ParameterTypeRequiredDefaultDescription
message_fieldstringdocumentsThe field in the pipeline context containing the documents to process
output_queueobjectnullOptional queue configuration for sending processed documents to a output queue
tika_endpointstringNohttp://127.0.0.1:9998Apache Tika server URL for content extraction
tika_timeout_in_secondsintNo120Tika processing timeout for each file
vision_model_providerstringYes-AI provider for image analysis
vision_modelstringYes-Model name for image analysis
pigo_facefinder_pathstringYes-Path to Pigo face detection binary
chunk_sizeintYes-Text chunking size for extracted content
image_content_formatstringNo“data_uri”Could be “data_uri” or “binary”. The format that an image will be encoded in in order to be sent to a vision model

Example #

- file_extraction:
  tika_endpoint: http://127.0.0.1:9998
  tika_timeout_in_seconds: 120
  chunk_size: 7000
  vision_model_provider: openai
  vision_model: gpt-4o
  pigo_facefinder_path: /path/to/pigo/facefinder
Edit Edit this page