openclaw-voice/USAGE_GUIDE.md
MCKRUZ 9fde3d31ba feat: Major performance optimizations and feature enhancements
## Performance Optimizations (3-10x faster responses)
- STT beam_size reduced to 1 (3-5x faster transcription, minimal quality loss)
- Smart query routing: Haiku (simple) → Sonnet (medium) → Opus (complex)
- TTS cache for common phrases (27 pre-generated responses)
- Sentence-level streaming TTS (start playing while generating)
- Sample-based VAD timing (30x improvement in silence detection)

## TTS Engine Upgrade
- Migrated from Chatterbox to Chatterbox-Turbo
- Zero-shot voice cloning (no fine-tuning required)
- Native paralinguistic tag support ([laugh], [sigh], [chuckle], etc.)
- Emotion presets with temperature control
- Improved marker conversion (*action*, (action), ~action~)

## Discord Bot Enhancements
- Multi-agent support (Jarvis, Sage)
- Improved voice receiving with discord-ext-voice-recv
- Enhanced /join, /leave, /status commands
- Per-agent personality configuration
- Better audio sink/receiver implementation

## OpenClaw Integration
- WebSocket support for Gateway communication
- Query complexity routing (auto-select model)
- Improved error handling and retries
- Session management per Discord guild
- Better latency tracking

## Pipeline Improvements
- Sentence splitter for streaming optimization
- Query router for intelligent model selection
- Enhanced VAD receiver with sample-based timing
- Improved audio buffering and format conversion
- Better transcript management

## Documentation
- Added QUICK_START.md (5-minute test guide)
- Added OPTIMIZATION_SUMMARY.md (performance analysis)
- Added DISCORD_OPTIMIZATION_TEST.md (testing guide)
- Added USAGE_GUIDE.md (comprehensive usage)
- Updated README.md with optimization details

## Utilities & Scripts
- Added get_invite_link.py (Discord bot invite)
- Added sync_commands.py, sync_to_guild.py (command sync)
- Added test_gateway.py, test_stt.py (testing utilities)
- Added openclaw_wrapper.py (wrapper script)
- Removed create_mock_turn_model.py (no longer needed)

## Configuration Updates
- STT model: medium → small (faster, acceptable quality)
- TTS engine: chatterbox → coqui (Turbo integration)
- Beam size: 5 → 1 (latency optimization)
- Added emotion_exaggeration per agent
- Updated .gitignore for project files

Total: ~2105 insertions, ~462 deletions across 35 files
Performance: ~5.5s total latency (down from 22-35s)
Target: ~3.5s (achieved in simple queries with cache)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-16 19:29:57 -05:00

13 KiB

OpenClaw Voice Bot - Usage Guide

What is This?

OpenClaw Voice Bot is a complete, production-ready voice assistant implementation for Discord that enables AI agents to naturally participate in voice conversations. It's designed to integrate with any LLM backend (OpenClaw, OpenAI, Anthropic, etc.) and provides:

  • Passive Voice Listening - No wake words or push-to-talk required
  • Smart Turn Detection - Uses Pipecat Smart Turn v3 to detect natural conversation completion
  • Intelligent Response Filtering - Two-tier relevance system (fast keyword + slow LLM) prevents over-responding
  • GPU-Accelerated STT/TTS - faster-whisper and Chatterbox TTS for low-latency processing
  • Multi-Agent Support - Switch between different AI personalities (Jarvis, Sage, etc.)
  • OpenAI-Compatible API - HTTP endpoints for TTS/STT that work with any client

Architecture Overview

Discord Voice Channel
  ↓
Per-user audio streams (opus → PCM 16kHz mono)
  ↓
Silero VAD (speech segmentation)
  ↓
Pipecat Smart Turn v3 (turn completion detection)
  ↓
faster-whisper STT (GPU-accelerated)
  ↓
Relevance Filter (should bot respond?)
  ↓
YOUR LLM BACKEND (OpenClaw / OpenAI / Anthropic / etc.)
  ↓
Chatterbox TTS (GPU-accelerated, paralinguistic)
  ↓
Discord Voice TX (48kHz stereo playback)

Plus: FastAPI server with OpenAI-compatible /v1/audio/speech and /v1/audio/transcriptions endpoints.

System Requirements

Hardware

  • GPU: NVIDIA GPU with CUDA support (RTX 3060+ recommended, 8GB+ VRAM)
  • RAM: 16GB minimum, 32GB+ recommended
  • Storage: 10GB free space (for models and voice files)

Software

  • OS: Windows 10/11, Linux
  • Python: 3.12 or higher
  • CUDA: 12.x (for GPU acceleration)
  • FFmpeg: Required for audio processing
  • Git: For cloning repository

Installation

1. Clone Repository

git clone https://github.com/MCKRUZ/openclaw-voice.git
cd openclaw-voice

2. Install Dependencies

Windows:

setup.bat

Linux:

chmod +x setup.sh
./setup.sh

This will:

  • Create Python virtual environment
  • Install all dependencies
  • Download ML models (on first run)
  • Set up directory structure

3. Configure Environment

Create .env file:

cp .env.example .env

Edit .env with your configuration:

# Discord
DISCORD_BOT_TOKEN=your_discord_bot_token_here

# Your LLM Backend (choose one or configure custom)
# Option 1: OpenClaw Gateway (if you have OpenClaw running)
OPENCLAW_BASE_URL=http://localhost:18789
OPENCLAW_AUTH_TOKEN=your_gateway_token

# Option 2: OpenAI Direct
OPENAI_API_KEY=sk-...

# Option 3: Anthropic Direct
ANTHROPIC_API_KEY=sk-ant-...

# Server
SERVER_HOST=0.0.0.0
SERVER_PORT=8880

# Pipeline (optional overrides)
# PIPELINE__STT__MODEL_SIZE=medium
# PIPELINE__STT__DEVICE=cuda
# PIPELINE__TTS__DEVICE=cuda

4. Provide Voice Reference Files

Place 10-30 second voice samples in server/voices/:

  • server/voices/jarvis.wav - Voice reference for Jarvis agent
  • server/voices/sage.wav - Voice reference for Sage agent

Requirements:

  • Format: WAV
  • Sample rate: 22-48kHz
  • Duration: 10-30 seconds
  • Quality: Clean speech, minimal background noise

Validate voice files:

python scripts/validate_voices.py

5. Discord Bot Setup

  1. Go to Discord Developer Portal
  2. Create a new application
  3. Go to "Bot" section → Click "Add Bot"
  4. Enable these Privileged Gateway Intents:
    • Server Members Intent
    • Message Content Intent
  5. Copy bot token to .env file
  6. Go to "OAuth2" → "URL Generator"
  7. Select scopes: bot, applications.commands
  8. Select permissions:
    • Send Messages
    • Connect (Voice)
    • Speak (Voice)
    • Use Voice Activity
  9. Use generated URL to invite bot to your server

Integrating Your LLM Backend

The bot uses a clean interface in openclaw_client/client.py that you need to implement for your LLM backend.

Current Implementation (Stub)

The repository includes a stub implementation that you replace with your actual LLM integration:

# openclaw_client/client.py

async def _send_request(self, agent: str, message: str, context: str, speaker: str) -> str:
    """
    TODO: Replace with actual LLM API when available.

    This is where you integrate YOUR LLM backend:
    - OpenClaw Gateway (OpenAI-compatible endpoint)
    - OpenAI API (direct)
    - Anthropic API (direct)
    - Local LLM (llama.cpp, vLLM, etc.)
    - Custom API
    """
    # Your implementation here

Integration Options

Option 1: OpenClaw Gateway

If you run OpenClaw, use its OpenAI-compatible chat completion endpoint:

import httpx

async def _send_request(self, agent, message, context, speaker):
    url = f"{self.config.base_url}/v1/chat/completions"
    headers = {"Authorization": f"Bearer {self.config.auth_token}"}

    messages = [
        {"role": "system", "content": self.AGENT_PERSONALITIES[agent]},
        {"role": "system", "content": f"Recent conversation:\n{context}"},
        {"role": "user", "content": f"[Voice] {speaker} said: {message}"}
    ]

    async with httpx.AsyncClient() as client:
        response = await client.post(url, json={
            "model": agent,
            "messages": messages,
            "stream": False
        }, headers=headers)
        data = response.json()
        return data["choices"][0]["message"]["content"]

Option 2: OpenAI Direct

from openai import AsyncOpenAI

async def _send_request(self, agent, message, context, speaker):
    client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))

    response = await client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": self.AGENT_PERSONALITIES[agent]},
            {"role": "system", "content": f"Recent conversation:\n{context}"},
            {"role": "user", "content": f"[Voice] {speaker} said: {message}"}
        ]
    )
    return response.choices[0].message.content

Option 3: Anthropic Direct

from anthropic import AsyncAnthropic

async def _send_request(self, agent, message, context, speaker):
    client = AsyncAnthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

    system_prompt = f"{self.AGENT_PERSONALITIES[agent]}\n\nRecent conversation:\n{context}"

    response = await client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {"role": "user", "content": f"[Voice] {speaker} said: {message}"}
        ]
    )
    return response.content[0].text

Usage

Starting the Bot

Windows:

activate.bat
python run.py

Linux:

source venv/bin/activate
python run.py

You should see:

======================================================================
Jarvis Voice Bot Starting
======================================================================
Loading configuration...
Initializing TTS and STT engines...
✓ TTS engine initialized (cuda)
✓ STT engine initialized (medium on cuda)
✓ API server initialized (port 8880)
✓ Discord bot started
✓ API server started on 0.0.0.0:8880

All services running. Press Ctrl+C to stop.

Discord Commands

Voice Channel Commands:

  • /join [channel] - Join voice channel
  • /leave - Disconnect from voice channel
  • /status - Show bot status and statistics

Agent Configuration:

  • /agent <jarvis|sage> - Switch active agent
  • /sensitivity <low|medium|high> - Adjust relevance threshold
    • Low: Only responds to name mentions
    • Medium: Name mentions + relevant questions (default)
    • High: More proactive responses

API Endpoints

The bot exposes OpenAI-compatible endpoints:

Text-to-Speech:

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Hello from Jarvis!",
    "voice": "jarvis",
    "response_format": "wav"
  }' \
  --output output.wav

Speech-to-Text:

curl -X POST http://localhost:8880/v1/audio/transcriptions \
  -F "file=@input.wav" \
  -F "model=whisper-1"

Health Check:

curl http://localhost:8880/health

Configuration

config.yaml

The main configuration file with all settings. Key sections:

discord:
  command_prefix: "/"

agents:
  default_agent: "jarvis"
  jarvis:
    name: "Jarvis"
    voice_file: "jarvis.wav"
    emotion_exaggeration: 1.0
  sage:
    name: "Sage"
    voice_file: "sage.wav"
    emotion_exaggeration: 0.8

openclaw:
  base_url: "http://localhost:18789"
  auth_token: null  # From env: OPENCLAW_AUTH_TOKEN
  timeout: 5.0

pipeline:
  vad:
    threshold: 0.5
    min_speech_duration: 0.2

  smart_turn:
    threshold: 0.7
    max_wait_timeout: 3.0

  stt:
    model_size: "medium"
    device: "cuda"
    beam_size: 5

  relevance:
    sensitivity: "medium"
    fast_path_keywords: ["jarvis", "sage"]

  tts:
    device: "cuda"
    sample_rate: 24000

Environment Variable Overrides

Override any config setting using format:

SECTION__SUBSECTION__KEY=value

Examples:

DISCORD__TOKEN=your_token
OPENCLAW__BASE_URL=http://192.168.1.100:8080
PIPELINE__STT__MODEL_SIZE=large-v3
SERVER__PORT=9000

Production Deployment

Before Going Live

  • Download real Smart Turn v3 model from HuggingFace pipecat-ai/smart-turn-v3
  • Remove mock ONNX model (scripts/create_mock_turn_model.py)
  • Configure actual LLM backend (replace stub in openclaw_client/client.py)
  • Provide high-quality voice reference files
  • Test end-to-end voice flow
  • Run full test suite: pytest
  • Monitor GPU memory and CPU usage
  • Test with multiple concurrent users
  • Set up logging/monitoring
  • Configure rate limiting (if exposing API publicly)
  • Review security settings (CORS, auth)

Performance Targets

Stage Target Acceptable
Smart Turn 50ms 100ms
STT 300ms 500ms
Relevance (fast) 10ms 20ms
Relevance (slow) 1000ms 2000ms
LLM Backend 2000ms 5000ms
TTS first chunk 300ms 600ms
Total ~3s ~7s

GPU Memory Usage

Model VRAM Usage
faster-whisper (medium) ~2GB
faster-whisper (large-v3) ~4GB
Chatterbox TTS ~2-3GB
Smart Turn v3 (CPU) 0GB
Silero VAD (CPU) 0GB
Total ~4-7GB

Troubleshooting

See README.md for detailed troubleshooting guide.

Common issues:

  • Bot doesn't join voice channel → Check Discord permissions
  • No audio output → Validate voice reference files
  • Bot responds to everything → Lower sensitivity: /sensitivity low
  • GPU out of memory → Use smaller STT model: PIPELINE__STT__MODEL_SIZE=small
  • High latency → Check LLM backend response time

Testing

# Run all tests (318 tests)
pytest

# With coverage
pytest --cov=. --cov-report=html

# Specific test file
pytest tests/test_orchestrator.py -v

# Integration tests
pytest tests/test_integration.py -v

Project Structure

openclaw-voice/
├── config.yaml              # Main configuration
├── .env                     # Environment variables (create from .env.example)
├── run.py                   # Main entry point
├── requirements.txt         # Python dependencies
│
├── server/                  # FastAPI, STT, TTS
│   ├── app.py              # API server
│   ├── stt.py              # Speech-to-Text
│   ├── tts.py              # Text-to-Speech
│   └── voices/             # Voice reference files (user-provided)
│
├── discord_bot/            # Discord integration
│   ├── bot.py              # Bot setup
│   ├── commands.py         # Slash commands
│   ├── voice_session.py    # Session management
│   └── audio_bridge.py     # Audio I/O
│
├── pipeline/               # Voice processing
│   ├── orchestrator.py     # Main coordinator
│   ├── audio_buffer.py     # Ring buffers
│   ├── vad.py              # Voice activity detection
│   ├── turn_detector.py    # Smart Turn v3
│   ├── transcriber.py      # STT pipeline
│   ├── transcript_manager.py  # Conversation context
│   └── relevance_filter.py # Response filtering
│
├── openclaw_client/        # LLM Backend Integration (CUSTOMIZE THIS!)
│   └── client.py           # API client (replace stub with your LLM)
│
└── tests/                  # Unit tests (318 tests)

Contributing

This is a reference implementation. To adapt for your use:

  1. Fork the repository
  2. Implement your LLM backend in openclaw_client/client.py
  3. Update configuration for your setup
  4. Provide your own voice reference files
  5. Test thoroughly before deploying

Support

For issues, questions, or feature requests:


Status: 14/14 phases complete (100%) 🎉 Tests: 318 tests passing GPU Memory: ~4-7GB (medium STT + TTS) Latency: ~3-7 seconds end-to-end Production Ready: Yes (after implementing your LLM backend)