Document Text Attachment Extraction Processor #
Extracts the full text content from a document and splits it into chunks.
When extract_attachments is enabled, embedded images are extracted and linked
as attachments. When it is disabled, the processor extracts text only.
Note: For attachment text extraction, use
attachment_text_extractioninstead.
Requirements #
| Service | Required for |
|---|---|
| Apache Tika server | Text extraction from PDF, DOCX, XLSX, and other Tika-backed formats; also used for embedded-image OCR when extract_attachments is enabled |
Supported formats #
| Supported format | Notes |
|---|---|
| Text and OCR for image-only pages | |
| PPTX / PPT / PPTM | Per-slide text and embedded image OCR |
| Image | Description generated by a vision model |
| DOCX, XLSX and other office formats | Plain-text extraction |
Configuration #
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
message_field | string | No | messages | Pipeline context key for the input messages |
output_queue | object | No | null | Queue to push processed documents to |
tika_endpoint | string | No | http://127.0.0.1:9998 | Apache Tika server URL |
tika_timeout_in_seconds | int | No | 120 | Per-file Tika timeout |
chunk_size | int | Yes | — | Maximum character length of each text chunk |
extract_attachments | bool | No | true | Whether embedded attachments/images are extracted from documents |
vision_model_provider | string | No | — | Provider ID for the vision model used to describe images |
vision_model | string | No | — | Model name for image description |
image_content_format | string | No | data_uri | How images are sent to the vision model: data_uri or binary |
llm_generation_lang | string | No | (app default) | BCP 47 language tag for LLM-generated content (e.g. en-US, zh-CN) |
Example #
- document_text_attachment_extraction:
tika_endpoint: http://127.0.0.1:9998
chunk_size: 7000
extract_attachments: true
vision_model_provider: openai
vision_model: gpt-4o
output_queue:
name: "documents_with_text"