---
title: "Attachment Text Extraction"
date: 0001-01-01
summary: "Attachment Text Extraction Processor #  Extracts text content from an attachment and stores the result as the attachment&rsquo;s searchable text.
For image attachments, the processor uses a vision model to generate a text description. For document attachments, it uses Apache Tika text extraction.
This processor does not extract embedded attachments from the attachment being processed. Embedded images inside PDFs, PPTX files, and similar formats are removed from the extracted text instead of being uploaded as child attachments."
---


## Attachment Text Extraction Processor

Extracts text content from an attachment and stores the result as the
attachment's searchable text.

For image attachments, the processor uses a vision model to generate a text
description. For document attachments, it uses Apache Tika text extraction.

This processor does **not** extract embedded attachments from the attachment
being processed. Embedded images inside PDFs, PPTX files, and similar formats
are removed from the extracted text instead of being uploaded as child attachments.

### Requirements

| Service | Required for |
|---|---|
| [Apache Tika](https://tika.apache.org/) server | Text extraction from PDF, DOCX, XLSX, PPTX, and other document formats |

### Supported formats

| Supported format | Notes |
|---|---|
| Image | Description generated by a vision model |
| PDF | Plain-text extraction; embedded images are ignored |
| PPTX / PPT / PPTM | Plain-text extraction; embedded images are ignored |
| DOCX, XLSX and other office formats | Plain-text extraction |

### Configuration

| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| `message_field` | string | No | `messages` | Pipeline context key for the input messages |
| `tika_endpoint` | string | No | `http://127.0.0.1:9998` | Apache Tika server URL |
| `tika_timeout_in_seconds` | int | No | `120` | Per-file Tika timeout |
| `vision_model_provider` | string | No | — | Provider ID for the vision model used to describe images |
| `vision_model` | string | No | — | Model name for image description |
| `image_content_format` | string | No | `data_uri` | How images are sent to the vision model: `data_uri` or `binary` |
| `llm_generation_lang` | string | No | *(app default)* | BCP 47 language tag for LLM-generated content (e.g. `en-US`, `zh-CN`) |

### Example

```yaml
- attachment_text_extraction:
    tika_endpoint: http://127.0.0.1:9998
    vision_model_provider: openai
    vision_model: gpt-4o
```
