Document Text Attachment Extraction

Document Text Attachment Extraction Processor #

Extracts the full text content from a document and splits it into chunks.

When extract_attachments is enabled, embedded images are extracted and linked as attachments. When it is disabled, the processor extracts text only.

Note: For attachment text extraction, use attachment_text_extraction instead.

Requirements #

ServiceRequired for
Apache Tika serverText extraction from PDF, DOCX, XLSX, and other Tika-backed formats; also used for embedded-image OCR when extract_attachments is enabled

Supported formats #

Supported formatNotes
PDFText and OCR for image-only pages
PPTX / PPT / PPTMPer-slide text and embedded image OCR
ImageDescription generated by a vision model
DOCX, XLSX and other office formatsPlain-text extraction

Configuration #

ParameterTypeRequiredDefaultDescription
message_fieldstringNomessagesPipeline context key for the input messages
output_queueobjectNonullQueue to push processed documents to
tika_endpointstringNohttp://127.0.0.1:9998Apache Tika server URL
tika_timeout_in_secondsintNo120Per-file Tika timeout
chunk_sizeintYesMaximum character length of each text chunk
extract_attachmentsboolNotrueWhether embedded attachments/images are extracted from documents
vision_model_providerstringNoProvider ID for the vision model used to describe images
vision_modelstringNoModel name for image description
image_content_formatstringNodata_uriHow images are sent to the vision model: data_uri or binary
llm_generation_langstringNo(app default)BCP 47 language tag for LLM-generated content (e.g. en-US, zh-CN)

Example #

- document_text_attachment_extraction:
    tika_endpoint: http://127.0.0.1:9998
    chunk_size: 7000
    extract_attachments: true
    vision_model_provider: openai
    vision_model: gpt-4o
    output_queue:
      name: "documents_with_text"
Edit Edit this page