Attachment Text Extraction

Attachment Text Extraction Processor #

Extracts text content from an attachment and stores the result as the attachment’s searchable text.

For image attachments, the processor uses a vision model to generate a text description. For document attachments, it uses Apache Tika text extraction.

This processor does not extract embedded attachments from the attachment being processed. Embedded images inside PDFs, PPTX files, and similar formats are removed from the extracted text instead of being uploaded as child attachments.

Requirements #

Service	Required for
Apache Tika server	Text extraction from PDF, DOCX, XLSX, PPTX, and other document formats

Supported formats #

Supported format	Notes
Image	Description generated by a vision model
PDF	Plain-text extraction; embedded images are ignored
PPTX / PPT / PPTM	Plain-text extraction; embedded images are ignored
DOCX, XLSX and other office formats	Plain-text extraction

Configuration #

Parameter	Type	Required	Default	Description
`message_field`	string	No	`messages`	Pipeline context key for the input messages
`tika_endpoint`	string	No	`http://127.0.0.1:9998`	Apache Tika server URL
`tika_timeout_in_seconds`	int	No	`120`	Per-file Tika timeout
`vision_model_provider`	string	No	—	Provider ID for the vision model used to describe images
`vision_model`	string	No	—	Model name for image description
`image_content_format`	string	No	`data_uri`	How images are sent to the vision model: `data_uri` or `binary`
`llm_generation_lang`	string	No	(app default)	BCP 47 language tag for LLM-generated content (e.g. `en-US`, `zh-CN`)

Example #

- attachment_text_extraction:
    tika_endpoint: http://127.0.0.1:9998
    vision_model_provider: openai
    vision_model: gpt-4o