Attachment Text Extraction Processor #
Extracts text content from an attachment and stores the result as the attachment’s searchable text.
For image attachments, the processor uses a vision model to generate a text description. For document attachments, it uses Apache Tika text extraction.
This processor does not extract embedded attachments from the attachment being processed. Embedded images inside PDFs, PPTX files, and similar formats are removed from the extracted text instead of being uploaded as child attachments.
Requirements #
| Service | Required for |
|---|---|
| Apache Tika server | Text extraction from PDF, DOCX, XLSX, PPTX, and other document formats |
Supported formats #
| Supported format | Notes |
|---|---|
| Image | Description generated by a vision model |
| Plain-text extraction; embedded images are ignored | |
| PPTX / PPT / PPTM | Plain-text extraction; embedded images are ignored |
| DOCX, XLSX and other office formats | Plain-text extraction |
Configuration #
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
message_field | string | No | messages | Pipeline context key for the input messages |
tika_endpoint | string | No | http://127.0.0.1:9998 | Apache Tika server URL |
tika_timeout_in_seconds | int | No | 120 | Per-file Tika timeout |
vision_model_provider | string | No | — | Provider ID for the vision model used to describe images |
vision_model | string | No | — | Model name for image description |
image_content_format | string | No | data_uri | How images are sent to the vision model: data_uri or binary |
llm_generation_lang | string | No | (app default) | BCP 47 language tag for LLM-generated content (e.g. en-US, zh-CN) |
Example #
- attachment_text_extraction:
tika_endpoint: http://127.0.0.1:9998
vision_model_provider: openai
vision_model: gpt-4o