Input PDF
1 version
Extract text from PDF documents for LLM analysis
Use This When
- Building document Q&A systems that need full PDF text content for LLM context
- Creating document processing pipelines that extract information from PDF reports
- Implementing automated document analysis for compliance or content review
- Preparing PDF text for RAG systems or knowledge base construction
What It Does
- Extracts plain text from PDF files using pypdf library (or converts to Markdown when enabled)
- Optionally prepends document filename as context for multi-document processing (plain text or Markdown)
- Streams documents sequentially from configured directory with optional looping
- Returns concatenated text from all pages as single String message
Works Best With
- This component → query-llm or llm-analyze-document for document Q&A
- PDF collections → this component → text analysis or information extraction
- Integration with chatbot-analyze-document for conversational document exploration
- Feeding downstream text processing for summarization or classification
Caveats
- Text extraction quality depends on PDF structure; scanned PDFs without OCR yield empty text
- Markdown conversion quality depends on PDF layout; complex layouts (tables/multi-column) may be imperfect or out of order
- Document context flag adds filename header; disable for clean text extraction
- Repeat mode loops indefinitely; disable for one-pass batch document processing
Versions
- dff24c25latestdefaultlinux/amd64
Automated release