Input PDF avatar

Input PDF

1 version
Open in App

Extract text from PDF documents for LLM analysis

Use This When

  • Building document Q&A systems that need full PDF text content for LLM context
  • Creating document processing pipelines that extract information from PDF reports
  • Implementing automated document analysis for compliance or content review
  • Preparing PDF text for RAG systems or knowledge base construction

What It Does

  • Extracts plain text from PDF files using pypdf library (or converts to Markdown when enabled)
  • Optionally prepends document filename as context for multi-document processing (plain text or Markdown)
  • Streams documents sequentially from configured directory with optional looping
  • Returns concatenated text from all pages as single String message

Works Best With

  • This component → query-llm or llm-analyze-document for document Q&A
  • PDF collections → this component → text analysis or information extraction
  • Integration with chatbot-analyze-document for conversational document exploration
  • Feeding downstream text processing for summarization or classification

Caveats

  • Text extraction quality depends on PDF structure; scanned PDFs without OCR yield empty text
  • Markdown conversion quality depends on PDF layout; complex layouts (tables/multi-column) may be imperfect or out of order
  • Document context flag adds filename header; disable for clean text extraction
  • Repeat mode loops indefinitely; disable for one-pass batch document processing

Versions

  • dff24c25latestdefaultlinux/amd64

    Automated release