Attachment Text Extraction

Attachment Text Extraction Processor #

Extracts text content from an attachment and stores the result as the attachment’s searchable text.

For image attachments, the processor uses a vision model to generate a text description. For document attachments, it uses Apache Tika text extraction.

This processor does not extract embedded attachments from the attachment being processed. Embedded images inside PDFs, PPTX files, and similar formats are removed from the extracted text instead of being uploaded as child attachments.

Requirements #

ServiceRequired for
Apache Tika serverText extraction from PDF, DOCX, XLSX, PPTX, and other document formats

Supported formats #

Supported formatNotes
ImageDescription generated by a vision model
PDFPlain-text extraction; embedded images are ignored
PPTX / PPT / PPTMPlain-text extraction; embedded images are ignored
DOCX, XLSX and other office formatsPlain-text extraction

Configuration #

ParameterTypeRequiredDefaultDescription
message_fieldstringNomessagesPipeline context key for the input messages
tika_endpointstringNohttp://127.0.0.1:9998Apache Tika server URL
tika_timeout_in_secondsintNo120Per-file Tika timeout
vision_model_providerstringNoProvider ID for the vision model used to describe images
vision_modelstringNoModel name for image description
image_content_formatstringNodata_uriHow images are sent to the vision model: data_uri or binary
llm_generation_langstringNo(app default)BCP 47 language tag for LLM-generated content (e.g. en-US, zh-CN)

Example #

- attachment_text_extraction:
    tika_endpoint: http://127.0.0.1:9998
    vision_model_provider: openai
    vision_model: gpt-4o
Edit Edit this page