📑 Nano PDF Parser

Document Processing · Local Extraction · Visual Deconstruction

A lightweight, blazing-fast, 100% offline PDF universal extractor. Purpose-built for LLM token compression, this document parsing skill lets AI devour hundred-page business reports and scanned documents in under a second.

OpenClaw Team

🚀 Quick Install

Run the following command in your terminal to install:

npx clawhub install nano-pdf

📊 Stats Overview

⭐ Stars	☁️ Total Calls	👥 Active Users	🎯 Stable Version
986	4.52M	5,100	v2.2.0

🎛️ How It Works

Force-feeding raw PDF files into LLMs often causes memory overflow and massive garbled hallucinations. This component serves as an ultra-fast pre-filter, completely restructuring the document feeding pipeline:

⚡ Millisecond Dual-track Dehydration Extraction: It auto-detects document type. For natively generated digital PDFs, it uses a lightweight parsing stream to instantly produce clean text; for old scanned documents, it mounts the Tesseract driver for OCR hard-cracking.
🧹 Perfect Layout & Table Restoration: Most basic extractors would mangle dual-column academic papers into a mess. Nano-PDF features excellent coordinate correction algorithms, maximally preserving original paragraph hierarchy and basic table matrices to prevent AI from losing contextual understanding.
✂️ Paginated Lazy-load Sampling: Supports granular indexing commands. You can request the model to extract only pages 15 through 20, avoiding full imports that would instantly blow through token limits.
🔐 Physical-level Air-gapped Security: Countless PDF tools on the market require uploading sensitive contracts to cloud servers for parsing. This component runs purely on local Node.js computation, retaining no copies and transmitting nothing — the strongest shield for financial and legal AI agents.

🧭 Typical Use Cases

💼 Scenario 1: Quarterly Report Auto-dissection

Every earnings season, throw hundreds of PDF annual report pages at AI with this skill. It can take over the file, quickly locate and extract only the pages containing "Balance Sheet" and "Board Overview." Then, paired with computational logic, generate a three-minute speed-read Markdown summary for financial analysts.

📜 Scenario 2: Bulk Legal Contract Risk Audit

When the legal library receives a batch of old-era scanned lease contracts, the agent terminal can use nano-pdf's image OCR engine to crack pages one by one, then have the LLM identify risks like "unfair clauses" or "penalty amount inconsistencies."

💻 Command Reference

After installation, you can let AI call these autonomously via conversation, or manually trigger operations from the CLI:

Basic command — extract a full PDF into high-density plain text and output to terminal:

clawhub execute nano-pdf file="./report_2026.pdf"

Surgical precision — extract only pages 10 to 15 of core content:

clawhub execute nano-pdf file="./contract_scan.pdf" \
  start_page=10 end_page=15

For image-based or scanned PDFs, enable OCR brute-force recognition:

clawhub execute nano-pdf file="./old_receipts.pdf" \
  use_ocr=true language="eng+chi_sim"

🛡️ Requirements & Performance

💻 Pure Offline Computing: Standard text extraction requires only basic Node.js dependencies and runs in milliseconds.
🔧 OCR Engine Requirement: If your workflow heavily depends on scanned document parsing, you must pre-install Tesseract on your host OS (Ubuntu/macOS) via a package manager (e.g., brew install tesseract).

🔗 View Source on GitHub