📑 Nano PDF Parser

Procesamiento de Documentos · Extracción Local · Deconstrucción Visual

A lightweight, blazing-fast, 100% offline PDF universal extractor. Purpose-built for LLM token compression, this document parsing skill lets AI devour hundred-page business reports and scanned documents in under a second.

Equipo OpenClaw

🚀 Instalación Rápida

Ejecute el siguiente comando en su terminal para instalar:

npx clawhub install nano-pdf

📊 Resumen de Estadísticas

⭐ Estrellas	☁️ Llamadas Totales	👥 Usuarios Activos	🎯 Versión Estable
986	4.52M	5,100	v2.2.0

🎛️ Cómo Funciona

Force-feeding raw PDF files into LLMs often causes memory overflow and massive garbled hallucinations. This component serves as an ultra-fast pre-filter, completely restructuring the document feeding pipeline:

⚡ Millisecond Dual-track Dehydration Extraction: It auto-detects document type. For natively generated digital PDFs, it uses a lightweight parsing stream to instantly produce clean text; for old scanned documents, it mounts the Tesseract driver for OCR hard-cracking.
🧹 Perfect Layout & Table Restoration: Most basic extractors would mangle dual-column academic papers into a mess. Nano-PDF features excellent coordinate correction algorithms, maximally preserving original paragraph hierarchy and basic table matrices to prevent AI from losing contextual understanding.
✂️ Paginated Lazy-load Sampling: Supports granular indexing commands. You can request the model to extract only pages 15 through 20, avoiding full imports that would instantly blow through token limits.
🔐 Physical-level Air-gapped Security: Countless PDF tools on the market require uploading sensitive contracts to cloud servers for parsing. This component runs purely on local Node.js computation, retaining no copies and transmitting nothing — the strongest shield for financial and legal AI agents.

🧭 Casos de Uso Típicos

💼 Escenario 1: Quarterly Report Auto-dissection

Every earnings season, throw hundreds of PDF annual report pages at AI with this skill. It can take over the file, quickly locate and extract only the pages containing "Balance Sheet" and "Board Overview." Then, paired with computational logic, generate a three-minute speed-read Markdown summary for financial analysts.

📜 Escenario 2: Bulk Legal Contract Risk Audit

When the legal library receives a batch of old-era scanned lease contracts, the agent terminal can use nano-pdf's image OCR engine to crack pages one by one, then have the LLM identify risks like "unfair clauses" or "penalty amount inconsistencies."

💻 Referencia de Comandos

Después de la instalación, puede dejar que la IA los invoque de forma autónoma a través de la conversación, o activar operaciones manualmente desde la CLI:

Basic command — extract a full PDF into high-density plain text and output to terminal:

clawhub execute nano-pdf file="./report_2026.pdf"

Surgical precision — extract only pages 10 to 15 of core content:

clawhub execute nano-pdf file="./contract_scan.pdf" \
  start_page=10 end_page=15

For image-based or scanned PDFs, enable OCR brute-force recognition:

clawhub execute nano-pdf file="./old_receipts.pdf" \
  use_ocr=true language="eng+chi_sim"

🛡️ Requisitos y Rendimiento

💻 Pure Offline Computing: Standard text extraction requires only basic Node.js dependencies and runs in milliseconds.
🔧 OCR Engine Requirement: If your workflow heavily depends on scanned document parsing, you must pre-install Tesseract on your host OS (Ubuntu/macOS) via a package manager (e.g., brew install tesseract).

🔗 Ver Código en GitHub