openclaw-voice/INTEGRATION_STATUS.md
MCKRUZ 9fde3d31ba feat: Major performance optimizations and feature enhancements
## Performance Optimizations (3-10x faster responses)
- STT beam_size reduced to 1 (3-5x faster transcription, minimal quality loss)
- Smart query routing: Haiku (simple) → Sonnet (medium) → Opus (complex)
- TTS cache for common phrases (27 pre-generated responses)
- Sentence-level streaming TTS (start playing while generating)
- Sample-based VAD timing (30x improvement in silence detection)

## TTS Engine Upgrade
- Migrated from Chatterbox to Chatterbox-Turbo
- Zero-shot voice cloning (no fine-tuning required)
- Native paralinguistic tag support ([laugh], [sigh], [chuckle], etc.)
- Emotion presets with temperature control
- Improved marker conversion (*action*, (action), ~action~)

## Discord Bot Enhancements
- Multi-agent support (Jarvis, Sage)
- Improved voice receiving with discord-ext-voice-recv
- Enhanced /join, /leave, /status commands
- Per-agent personality configuration
- Better audio sink/receiver implementation

## OpenClaw Integration
- WebSocket support for Gateway communication
- Query complexity routing (auto-select model)
- Improved error handling and retries
- Session management per Discord guild
- Better latency tracking

## Pipeline Improvements
- Sentence splitter for streaming optimization
- Query router for intelligent model selection
- Enhanced VAD receiver with sample-based timing
- Improved audio buffering and format conversion
- Better transcript management

## Documentation
- Added QUICK_START.md (5-minute test guide)
- Added OPTIMIZATION_SUMMARY.md (performance analysis)
- Added DISCORD_OPTIMIZATION_TEST.md (testing guide)
- Added USAGE_GUIDE.md (comprehensive usage)
- Updated README.md with optimization details

## Utilities & Scripts
- Added get_invite_link.py (Discord bot invite)
- Added sync_commands.py, sync_to_guild.py (command sync)
- Added test_gateway.py, test_stt.py (testing utilities)
- Added openclaw_wrapper.py (wrapper script)
- Removed create_mock_turn_model.py (no longer needed)

## Configuration Updates
- STT model: medium → small (faster, acceptable quality)
- TTS engine: chatterbox → coqui (Turbo integration)
- Beam size: 5 → 1 (latency optimization)
- Added emotion_exaggeration per agent
- Updated .gitignore for project files

Total: ~2105 insertions, ~462 deletions across 35 files
Performance: ~5.5s total latency (down from 22-35s)
Target: ~3.5s (achieved in simple queries with cache)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-16 19:29:57 -05:00

13 KiB

OpenClaw Gateway Integration Status

Last Updated: 2026-02-13

Completed Tasks

1. OpenClaw Gateway WebSocket Client Implementation

Status: COMPLETE

Location: openclaw_client/client.py

Changes Made:

  • Implemented full WebSocket JSON-RPC protocol
  • Added connect handshake (connect.challengeconnecthello-ok)
  • Implemented chat.send with event listening (delta → final)
  • Added session key generation (agent:<agentId>:discord:dm:<userId>)
  • Implemented automatic reconnection logic
  • Added per-guild client management via PerGuildOpenClawClient
  • Preserved existing send_message() interface for compatibility
  • Added connection statistics and latency tracking

Protocol Flow:

WebSocket Connect → connect.challenge → connect request → hello-ok response
↓
chat.send (with sessionKey, idempotencyKey) → ack (with runId) → delta events → final event

Configuration:

  • Updated utils/config.py to support WebSocket URL format
  • Added agent_id and session_scope configuration options
  • Added retry_timeout for extended retry attempts
  • Updated config.yaml openclaw section with WebSocket settings
  • Updated .env.example with WebSocket URL format and auth token

Dependencies:

  • Added websockets>=12.0 to requirements.txt

Testing:

  • ⚠️ Existing unit tests need updates for WebSocket client
  • ⚠️ Integration tests need real Gateway connection

🔧 Remaining Integration Work

2. Connect OpenClaw Client to Discord Bot

Status: PENDING

What Needs to be Done:

The OpenClawClient is implemented but not yet wired into the Discord bot pipeline. Here's what needs to happen:

A. Bot Initialization (in run.py or discord_bot/bot.py)

Create and initialize the OpenClaw Gateway client on bot startup:

# In run.py, after loading config:

from openclaw_client import OpenClawConfig, PerGuildOpenClawClient

# Create OpenClaw Gateway client configuration
openclaw_config = OpenClawConfig(
    base_url=config.openclaw.base_url,  # ws://192.168.50.9:18789
    auth_token=config.openclaw.token,
    timeout=config.openclaw.timeout,
    retry_timeout=config.openclaw.retry_timeout,
    agent_id=config.openclaw.agent_id,
    session_scope=config.openclaw.session_scope,
)

# Create per-guild client manager
openclaw_client = PerGuildOpenClawClient(openclaw_config)

# Connect to Gateway
logger.info("Connecting to OpenClaw Gateway...")
# Note: Connection happens lazily on first message, or explicitly:
# await openclaw_client.get_or_create(guild_id).connect()

B. Pipeline Orchestrator Integration

The orchestrator expects an llm_client callable. Create a wrapper:

# In voice session or orchestrator setup:

async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
    """Wrapper for OpenClaw Gateway client."""
    client = openclaw_client.get_or_create(guild_id)
    return await client.send_message(
        agent=agent,
        message=message,
        context="",  # Gateway manages context internally
        speaker=str(user_id)  # Used for session key generation
    )

# Pass to orchestrator:
orchestrator = PipelineOrchestrator(
    config=pipeline_config,
    vad=vad,
    turn_detector=turn_detector,
    transcriber=transcriber,
    transcript_manager=transcript_manager,
    relevance_classifier=relevance_classifier,
    llm_client=llm_response_handler,  # ← Use wrapper
    tts_synthesizer=tts_synthesizer,
    audio_output_callback=audio_callback,
)

C. Agent Selection Integration

The VoiceSession tracks current_agent per guild. Ensure this is passed to the LLM handler:

async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
    # Get current agent from session
    session = session_manager.get_session(guild_id)
    current_agent = session.current_agent if session else "jarvis"

    # Send to Gateway with correct agent
    client = openclaw_client.get_or_create(guild_id)
    return await client.send_message(
        agent=current_agent,  # Use session's agent setting
        message=message,
        speaker=str(user_id)
    )

D. Cleanup on Disconnect

When bot disconnects from Discord or guild, close Gateway connection:

# In voice session cleanup:

async def cleanup_guild(guild_id: int):
    # Remove voice session
    await session_manager.remove_session(guild_id)

    # Disconnect OpenClaw client for this guild
    client = openclaw_client.get_or_create(guild_id)
    await client.disconnect()
    openclaw_client.remove_guild(guild_id)

3. Download Smart Turn v3 Model

Status: PENDING

Current State:

  • Mock ONNX model at models/smart_turn_v3.onnx (164 bytes placeholder)
  • Mock creation script at scripts/create_mock_turn_model.py

What to Do:

# Install huggingface_hub if not already installed
pip install huggingface_hub

# Download real model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='pipecat-ai/smart-turn-v3', filename='model.onnx', local_dir='models/')"

# Remove mock files
rm models/smart_turn_v3.onnx
rm scripts/create_mock_turn_model.py

# Verify model exists and is ~8MB
ls -lh models/model.onnx

4. Configure TTS to Use Existing Sage-Voice Server

Status: PENDING

Decision Point: You have two TTS options:

Your sage-voice server at http://192.168.50.47:8004 already works and has your voice models.

Modify server/tts.py to use HTTP client instead of built-in TTS:

# Replace Chatterbox/Coqui implementation with HTTP client

import httpx

class TTSSynthesizer:
    def __init__(self, tts_url: str, device: str = "cuda"):
        self.tts_url = tts_url  # http://192.168.50.47:8004
        self.device = device

    async def synthesize(
        self,
        text: str,
        voice: str,
        response_format: str = "pcm"
    ) -> bytes:
        """Call sage-voice TTS server."""
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.tts_url}/v1/audio/speech",
                json={
                    "input": text,
                    "voice": voice,  # jarvis or sage
                    "response_format": response_format
                },
                timeout=10.0
            )
            return response.content

Add to .env:

TTS_URL=http://192.168.50.47:8004

Option B: Use Built-in TTS (More Complex)

Provide voice reference files and use Coqui XTTS:

  • Place server/voices/jarvis.wav (10-30 seconds clean audio)
  • Place server/voices/sage.wav (10-30 seconds clean audio)
  • Keep existing server/tts.py implementation

Recommendation: Go with Option A to reuse your proven TTS infrastructure.


5. Environment Configuration

Status: PENDING

Create .env file in openclaw-voice directory:

# Copy example
cp .env.example .env

# Edit with your actual values

Required Configuration:

# Discord Bot (from Discord Developer Portal)
DISCORD_BOT_TOKEN=<your_discord_bot_token>

# OpenClaw Gateway (on Synology NAS)
OPENCLAW_BASE_URL=ws://192.168.50.9:18789
OPENCLAW_AUTH_TOKEN=<your_gateway_token>
OPENCLAW_AGENT_ID=main

# TTS Server (your existing sage-voice server)
TTS_URL=http://192.168.50.47:8004

# FastAPI Server (openclaw-voice API endpoints)
SERVER_HOST=0.0.0.0
SERVER_PORT=8880

# Pipeline Settings (optional overrides)
PIPELINE__STT__MODEL_SIZE=medium
PIPELINE__STT__DEVICE=cuda
PIPELINE__TTS__DEVICE=cuda

Where to Get Values:

  • DISCORD_BOT_TOKEN: Discord Developer Portal → Your Application → Bot → Token
  • OPENCLAW_AUTH_TOKEN: Check your NAS OpenClaw Gateway config or create new token
  • TTS_URL: Already running at 192.168.50.47:8004

6. Testing End-to-End Flow

Status: PENDING

Test Plan:

A. Test OpenClaw Gateway Connection

# Create test script: test_gateway_connection.py

import asyncio
from openclaw_client import create_client

async def test_connection():
    client = create_client(
        base_url="ws://192.168.50.9:18789",
        auth_token="<your_token>",
        agent_id="main"
    )

    try:
        await client.connect()
        print("✓ Connected to Gateway")

        response = await client.send_message(
            agent="jarvis",
            message="Hello, this is a test",
            speaker="test_user"
        )
        print(f"✓ Received response: {response}")

        await client.disconnect()
        print("✓ Disconnected")

    except Exception as e:
        print(f"✗ Error: {e}")

asyncio.run(test_connection())

B. Test Discord Bot End-to-End

  1. Start openclaw-voice bot:

    python run.py
    
  2. Join Discord voice channel

  3. Use slash commands:

    /join
    /agent jarvis
    /sensitivity medium
    
  4. Speak into microphone:

    • Bot should detect voice (VAD)
    • Wait for Smart Turn completion
    • Transcribe speech (STT)
    • Check relevance
    • Send to OpenClaw Gateway
    • Generate TTS response
    • Play audio back
  5. Check logs for latency breakdown:

    VAD: XXms
    Smart Turn: XXms
    STT: XXms
    Relevance: XXms
    Gateway: XXXXms
    TTS: XXms
    Total: ~3-7s
    

C. Test Agent Switching

/agent sage
[speak] "Tell me about philosophy"
[expect Sage's voice and personality]

/agent jarvis
[speak] "What's the weather?"
[expect Jarvis's voice and personality]

D. Test Relevance Filtering

/sensitivity low
[speak unrelated conversation]
[expect bot to stay quiet]

[speak "Hey Jarvis, ..." or "Jarvis, ..."]
[expect bot to respond]

/sensitivity high
[speak relevant question without name]
[expect bot to respond]

📋 Quick Start Checklist

To get openclaw-voice running with your OpenClaw Gateway:

  • Implement OpenClaw Gateway WebSocket client
  • Add websockets dependency
  • Update configuration files
  • Download Smart Turn v3 model from HuggingFace
  • Create .env file with your credentials
  • Modify server/tts.py to use your existing TTS server (Option A)
  • Wire OpenClawClient into bot initialization (run.py or discord_bot/bot.py)
  • Create LLM response handler wrapper for orchestrator
  • Test Gateway connection standalone
  • Install dependencies: pip install -r requirements.txt
  • Run end-to-end test with Discord voice

🎯 Next Steps

  1. Complete Task #2: Download real Smart Turn model
  2. Complete Task #3: Configure TTS (recommend Option A - use existing server)
  3. Complete Task #4: Create .env with your credentials
  4. Wire up the bot: Integrate OpenClawClient into Discord bot initialization
  5. Complete Task #5: Test end-to-end flow

📚 Reference

Session Key Format

agent:<agentId>:discord:dm:<userId>

Examples:

  • agent:main:discord:dm:123456789 (user 123456789 talking to main agent)
  • agent:jarvis:discord:dm:987654321 (user 987654321 talking to jarvis agent)

Gateway Protocol Summary

1. WebSocket Connect
2. Server sends: connect.challenge (with nonce)
3. Client sends: connect request (with auth token)
4. Server sends: hello-ok response (with server info)
5. Client sends: chat.send (with sessionKey, message, idempotencyKey)
6. Server sends: ack response (with runId)
7. Server sends: delta events (streaming response)
8. Server sends: final event (complete response)

File Locations

  • OpenClaw Client: openclaw_client/client.py
  • Configuration: utils/config.py, config.yaml, .env
  • Bot Entry: run.py
  • Discord Bot: discord_bot/bot.py
  • Voice Sessions: discord_bot/voice_session.py
  • Pipeline: pipeline/orchestrator.py
  • TTS: server/tts.py

🐛 Troubleshooting

WebSocket Connection Fails

  • Verify Gateway is running: ssh Hyriel@192.168.50.9 'sudo /usr/local/bin/docker logs --tail 50 openclaw-gateway'
  • Check NAS firewall allows port 18789
  • Verify auth token is correct
  • Check logs for connection errors

Bot Doesn't Respond to Voice

  • Check VAD is detecting speech (logs should show "speech detected")
  • Verify STT model is downloaded (first run downloads ~500MB-5GB)
  • Check OpenClaw Gateway receives messages (NAS logs)
  • Verify TTS server is reachable: curl http://192.168.50.47:8004/health

Agent Switching Doesn't Work

  • Verify session management is passing current_agent to LLM handler
  • Check that session.current_agent is updated by /agent command
  • Verify Gateway session key uses correct agent ID

Status Summary: 40% Complete (2/5 major tasks done)

Estimated Time to Completion: 2-4 hours (with testing)