🎙️ Whisper 音声パーサー

オーディオ処理 · ローカル文字起こし · 多言語対応

Let your LLM truly "hear" the soundwaves of the human world! Built on OpenAI's powerful open-source Whisper model engine, converting audio clips to high-quality text streams in milliseconds — entirely offline on your local hardware.

OpenClaw チーム

🚀 クイックインストール

ターミナルで以下のコマンドを実行してインストール：

npx clawhub install openai-whisper

📊 統計概要

⭐ スター	☁️ 総呼出	👥 アクティブユーザー	🎯 安定バージョン
871	6.13M	7,800	v2.1.4

🎛️ 仕組み

Unlike expensive per-minute cloud-based speech recognition services (like Azure / AWS), this plugin dominates with pure local brute-force algorithms:

💻 True Edge-side Engine Rendering: Completely free from internet restrictions. Pull tiny, base, or even large Whisper weight models onto your device and decode audio using host CPU / GPU memory — 100% protection for meeting confidentiality and personal recording privacy.
🌐 99+ Language Support: Whether the speaker has a thick Indian-accented English or Chinese dialogue peppered with Japanese vocabulary, Whisper's generalization ability can precisely transcribe and record mixed-language phrases seamlessly.
⏱️ Auto-timestamping & SRT Attachment: Goes beyond plain text output. When rich format output is requested, it provides VTT / SRT timeline breakpoints accurate to the millisecond — perfect as a foundational pre-processing pipeline for fully automated subtitle video slicing.
🧹 Multi-file Error-tolerant Encapsulation: Automatically strips silent segments from input audio streams, and natively supports mp3, wav, m4a, ogg, and various other formats without manual FFmpeg re-encoding.

🧭 典型的なユースケース

📝 シナリオ 1: Ultimate Meeting Minutes Extractor

Integrated with internal workflows: after a three-hour international board meeting, simply drop the recorder's M4A file into a designated folder. The monitoring Agent mounts openai-whisper for full-speed decoding, then immediately calls the LLM to compress tens of thousands of words of chaotic dialogue into "Key Agenda Items" and "Who Spoke" Markdown tables, and pushes them to the entire company's Slack.

🤖 シナリオ 2: Retro Hardware Voice Assistant (Siri Killer)

Mount the ultra-lightweight tiny.en model on a Raspberry Pi or similar IoT terminal as a persistent listening environment. No typing needed at home — just speak into the microphone, the plugin instantly converts to text and hands it to the LLM intent processor, achieving silky-smooth "streaming auditory feedback" home voice control.

💻 コマンドリファレンス

インストール後、会話を通じてAIに自律的に呼び出させるか、CLIから手動で操作をトリガーできます：

Speed transcription mode — use the default base model for Chinese audio extraction:

clawhub execute openai-whisper file="./meeting_01.mp3" language="zh"

Cross-language translation — force the model to not just understand, but directly translate raw audio to English:

clawhub execute openai-whisper file="./french_interview.wav" task="translate"

Professional subtitles — output detailed SRT array structures with timestamps:

clawhub execute openai-whisper file="./podcast_raw.m4a" output_format="srt" model="large-v3"

🛡️ 要件とパフォーマンス

🔧 Required Toolchain: This is a hardcore AI model module. Before execution, your host system must have ffmpeg (for underlying audio decoding) and a working python3 (to support native Whisper's inference pipeline).
💻 Hardware Constraints: Running the large top-tier model on a thin laptop without GPU/CUDA acceleration may take as long or longer than the meeting duration itself. Low-spec machines should default to base or small weight levels.

🔗 GitHubでソースを見る