Input PDF

1 version

Extract text from PDF documents for LLM analysis

Use This When

Building document Q&A systems that need full PDF text content for LLM context
Creating document processing pipelines that extract information from PDF reports
Implementing automated document analysis for compliance or content review
Preparing PDF text for RAG systems or knowledge base construction

What It Does

Extracts plain text from PDF files using pypdf library (or converts to Markdown when enabled)
Optionally prepends document filename as context for multi-document processing (plain text or Markdown)
Streams documents sequentially from configured directory with optional looping
Returns concatenated text from all pages as single String message

Works Best With

This component → query-llm or llm-analyze-document for document Q&A
PDF collections → this component → text analysis or information extraction
Integration with chatbot-analyze-document for conversational document exploration
Feeding downstream text processing for summarization or classification

Caveats

Text extraction quality depends on PDF structure; scanned PDFs without OCR yield empty text
Markdown conversion quality depends on PDF layout; complex layouts (tables/multi-column) may be imperfect or out of order
Document context flag adds filename header; disable for clean text extraction
Repeat mode loops indefinitely; disable for one-pass batch document processing

Versions

dff24c25latestdefaultlinux/amd64
Automated release
4/7/2026