📑 Nano PDFパーサー

ドキュメント処理 · ローカル抽出 · ビジュアル解析

A lightweight, blazing-fast, 100% offline PDF universal extractor. Purpose-built for LLM token compression, this document parsing skill lets AI devour hundred-page business reports and scanned documents in under a second.

OpenClaw チーム

🚀 クイックインストール

ターミナルで以下のコマンドを実行してインストール：

npx clawhub install nano-pdf

📊 統計概要

⭐ スター	☁️ 総呼出	👥 アクティブユーザー	🎯 安定バージョン
986	4.52M	5,100	v2.2.0

🎛️ 仕組み

Force-feeding raw PDF files into LLMs often causes memory overflow and massive garbled hallucinations. This component serves as an ultra-fast pre-filter, completely restructuring the document feeding pipeline:

⚡ Millisecond Dual-track Dehydration Extraction: It auto-detects document type. For natively generated digital PDFs, it uses a lightweight parsing stream to instantly produce clean text; for old scanned documents, it mounts the Tesseract driver for OCR hard-cracking.
🧹 Perfect Layout & Table Restoration: Most basic extractors would mangle dual-column academic papers into a mess. Nano-PDF features excellent coordinate correction algorithms, maximally preserving original paragraph hierarchy and basic table matrices to prevent AI from losing contextual understanding.
✂️ Paginated Lazy-load Sampling: Supports granular indexing commands. You can request the model to extract only pages 15 through 20, avoiding full imports that would instantly blow through token limits.
🔐 Physical-level Air-gapped Security: Countless PDF tools on the market require uploading sensitive contracts to cloud servers for parsing. This component runs purely on local Node.js computation, retaining no copies and transmitting nothing — the strongest shield for financial and legal AI agents.

🧭 典型的なユースケース

💼 シナリオ 1: Quarterly Report Auto-dissection

Every earnings season, throw hundreds of PDF annual report pages at AI with this skill. It can take over the file, quickly locate and extract only the pages containing "Balance Sheet" and "Board Overview." Then, paired with computational logic, generate a three-minute speed-read Markdown summary for financial analysts.

📜 シナリオ 2: Bulk Legal Contract Risk Audit

When the legal library receives a batch of old-era scanned lease contracts, the agent terminal can use nano-pdf's image OCR engine to crack pages one by one, then have the LLM identify risks like "unfair clauses" or "penalty amount inconsistencies."

💻 コマンドリファレンス

インストール後、会話を通じてAIに自律的に呼び出させるか、CLIから手動で操作をトリガーできます：

Basic command — extract a full PDF into high-density plain text and output to terminal:

clawhub execute nano-pdf file="./report_2026.pdf"

Surgical precision — extract only pages 10 to 15 of core content:

clawhub execute nano-pdf file="./contract_scan.pdf" \
  start_page=10 end_page=15

For image-based or scanned PDFs, enable OCR brute-force recognition:

clawhub execute nano-pdf file="./old_receipts.pdf" \
  use_ocr=true language="eng+chi_sim"

🛡️ 要件とパフォーマンス

💻 Pure Offline Computing: Standard text extraction requires only basic Node.js dependencies and runs in milliseconds.
🔧 OCR Engine Requirement: If your workflow heavily depends on scanned document parsing, you must pre-install Tesseract on your host OS (Ubuntu/macOS) via a package manager (e.g., brew install tesseract).

🔗 GitHubでソースを見る