MCKRUZ 9fde3d31ba feat: Major performance optimizations and feature enhancements

## Performance Optimizations (3-10x faster responses)
- STT beam_size reduced to 1 (3-5x faster transcription, minimal quality loss)
- Smart query routing: Haiku (simple) → Sonnet (medium) → Opus (complex)
- TTS cache for common phrases (27 pre-generated responses)
- Sentence-level streaming TTS (start playing while generating)
- Sample-based VAD timing (30x improvement in silence detection)

## TTS Engine Upgrade
- Migrated from Chatterbox to Chatterbox-Turbo
- Zero-shot voice cloning (no fine-tuning required)
- Native paralinguistic tag support ([laugh], [sigh], [chuckle], etc.)
- Emotion presets with temperature control
- Improved marker conversion (*action*, (action), ~action~)

## Discord Bot Enhancements
- Multi-agent support (Jarvis, Sage)
- Improved voice receiving with discord-ext-voice-recv
- Enhanced /join, /leave, /status commands
- Per-agent personality configuration
- Better audio sink/receiver implementation

## OpenClaw Integration
- WebSocket support for Gateway communication
- Query complexity routing (auto-select model)
- Improved error handling and retries
- Session management per Discord guild
- Better latency tracking

## Pipeline Improvements
- Sentence splitter for streaming optimization
- Query router for intelligent model selection
- Enhanced VAD receiver with sample-based timing
- Improved audio buffering and format conversion
- Better transcript management

## Documentation
- Added QUICK_START.md (5-minute test guide)
- Added OPTIMIZATION_SUMMARY.md (performance analysis)
- Added DISCORD_OPTIMIZATION_TEST.md (testing guide)
- Added USAGE_GUIDE.md (comprehensive usage)
- Updated README.md with optimization details

## Utilities & Scripts
- Added get_invite_link.py (Discord bot invite)
- Added sync_commands.py, sync_to_guild.py (command sync)
- Added test_gateway.py, test_stt.py (testing utilities)
- Added openclaw_wrapper.py (wrapper script)
- Removed create_mock_turn_model.py (no longer needed)

## Configuration Updates
- STT model: medium → small (faster, acceptable quality)
- TTS engine: chatterbox → coqui (Turbo integration)
- Beam size: 5 → 1 (latency optimization)
- Added emotion_exaggeration per agent
- Updated .gitignore for project files

Total: ~2105 insertions, ~462 deletions across 35 files
Performance: ~5.5s total latency (down from 22-35s)
Target: ~3.5s (achieved in simple queries with cache)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-16 19:29:57 -05:00

13 KiB

Raw Permalink Blame History

OpenClaw Gateway Integration Status

Last Updated: 2026-02-13

✅ Completed Tasks

1. OpenClaw Gateway WebSocket Client Implementation

Status: ✅ COMPLETE

Location: openclaw_client/client.py

Changes Made:

✅ Implemented full WebSocket JSON-RPC protocol
✅ Added connect handshake (connect.challenge → connect → hello-ok)
✅ Implemented chat.send with event listening (delta → final)
✅ Added session key generation (agent:<agentId>:discord:dm:<userId>)
✅ Implemented automatic reconnection logic
✅ Added per-guild client management via PerGuildOpenClawClient
✅ Preserved existing send_message() interface for compatibility
✅ Added connection statistics and latency tracking

Protocol Flow:

WebSocket Connect → connect.challenge → connect request → hello-ok response
↓
chat.send (with sessionKey, idempotencyKey) → ack (with runId) → delta events → final event

Configuration:

✅ Updated utils/config.py to support WebSocket URL format
✅ Added agent_id and session_scope configuration options
✅ Added retry_timeout for extended retry attempts
✅ Updated config.yaml openclaw section with WebSocket settings
✅ Updated .env.example with WebSocket URL format and auth token

Dependencies:

✅ Added websockets>=12.0 to requirements.txt

Testing:

⚠️ Existing unit tests need updates for WebSocket client
⚠️ Integration tests need real Gateway connection

🔧 Remaining Integration Work

2. Connect OpenClaw Client to Discord Bot

Status: ⏳ PENDING

What Needs to be Done:

The OpenClawClient is implemented but not yet wired into the Discord bot pipeline. Here's what needs to happen:

A. Bot Initialization (in `run.py` or `discord_bot/bot.py`)

Create and initialize the OpenClaw Gateway client on bot startup:

# In run.py, after loading config:

from openclaw_client import OpenClawConfig, PerGuildOpenClawClient

# Create OpenClaw Gateway client configuration
openclaw_config = OpenClawConfig(
    base_url=config.openclaw.base_url,  # ws://192.168.50.9:18789
    auth_token=config.openclaw.token,
    timeout=config.openclaw.timeout,
    retry_timeout=config.openclaw.retry_timeout,
    agent_id=config.openclaw.agent_id,
    session_scope=config.openclaw.session_scope,
)

# Create per-guild client manager
openclaw_client = PerGuildOpenClawClient(openclaw_config)

# Connect to Gateway
logger.info("Connecting to OpenClaw Gateway...")
# Note: Connection happens lazily on first message, or explicitly:
# await openclaw_client.get_or_create(guild_id).connect()

B. Pipeline Orchestrator Integration

The orchestrator expects an llm_client callable. Create a wrapper:

# In voice session or orchestrator setup:

async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
    """Wrapper for OpenClaw Gateway client."""
    client = openclaw_client.get_or_create(guild_id)
    return await client.send_message(
        agent=agent,
        message=message,
        context="",  # Gateway manages context internally
        speaker=str(user_id)  # Used for session key generation
    )

# Pass to orchestrator:
orchestrator = PipelineOrchestrator(
    config=pipeline_config,
    vad=vad,
    turn_detector=turn_detector,
    transcriber=transcriber,
    transcript_manager=transcript_manager,
    relevance_classifier=relevance_classifier,
    llm_client=llm_response_handler,  # ← Use wrapper
    tts_synthesizer=tts_synthesizer,
    audio_output_callback=audio_callback,
)

C. Agent Selection Integration

The VoiceSession tracks current_agent per guild. Ensure this is passed to the LLM handler:

async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
    # Get current agent from session
    session = session_manager.get_session(guild_id)
    current_agent = session.current_agent if session else "jarvis"

    # Send to Gateway with correct agent
    client = openclaw_client.get_or_create(guild_id)
    return await client.send_message(
        agent=current_agent,  # Use session's agent setting
        message=message,
        speaker=str(user_id)
    )

D. Cleanup on Disconnect

When bot disconnects from Discord or guild, close Gateway connection:

# In voice session cleanup:

async def cleanup_guild(guild_id: int):
    # Remove voice session
    await session_manager.remove_session(guild_id)

    # Disconnect OpenClaw client for this guild
    client = openclaw_client.get_or_create(guild_id)
    await client.disconnect()
    openclaw_client.remove_guild(guild_id)

3. Download Smart Turn v3 Model

Status: ⏳ PENDING

Current State:

Mock ONNX model at models/smart_turn_v3.onnx (164 bytes placeholder)
Mock creation script at scripts/create_mock_turn_model.py

What to Do:

# Install huggingface_hub if not already installed
pip install huggingface_hub

# Download real model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='pipecat-ai/smart-turn-v3', filename='model.onnx', local_dir='models/')"

# Remove mock files
rm models/smart_turn_v3.onnx
rm scripts/create_mock_turn_model.py

# Verify model exists and is ~8MB
ls -lh models/model.onnx

4. Configure TTS to Use Existing Sage-Voice Server

Status: ⏳ PENDING

Decision Point: You have two TTS options:

Option A: Use Your Existing TTS Server (Recommended)

Your sage-voice server at http://192.168.50.47:8004 already works and has your voice models.

Modify server/tts.py to use HTTP client instead of built-in TTS:

# Replace Chatterbox/Coqui implementation with HTTP client

import httpx

class TTSSynthesizer:
    def __init__(self, tts_url: str, device: str = "cuda"):
        self.tts_url = tts_url  # http://192.168.50.47:8004
        self.device = device

    async def synthesize(
        self,
        text: str,
        voice: str,
        response_format: str = "pcm"
    ) -> bytes:
        """Call sage-voice TTS server."""
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.tts_url}/v1/audio/speech",
                json={
                    "input": text,
                    "voice": voice,  # jarvis or sage
                    "response_format": response_format
                },
                timeout=10.0
            )
            return response.content

Add to .env:

TTS_URL=http://192.168.50.47:8004

Option B: Use Built-in TTS (More Complex)

Provide voice reference files and use Coqui XTTS:

Place server/voices/jarvis.wav (10-30 seconds clean audio)
Place server/voices/sage.wav (10-30 seconds clean audio)
Keep existing server/tts.py implementation

Recommendation: Go with Option A to reuse your proven TTS infrastructure.

5. Environment Configuration

Status: ⏳ PENDING

Create .env file in openclaw-voice directory:

# Copy example
cp .env.example .env

# Edit with your actual values

Required Configuration:

# Discord Bot (from Discord Developer Portal)
DISCORD_BOT_TOKEN=<your_discord_bot_token>

# OpenClaw Gateway (on Synology NAS)
OPENCLAW_BASE_URL=ws://192.168.50.9:18789
OPENCLAW_AUTH_TOKEN=<your_gateway_token>
OPENCLAW_AGENT_ID=main

# TTS Server (your existing sage-voice server)
TTS_URL=http://192.168.50.47:8004

# FastAPI Server (openclaw-voice API endpoints)
SERVER_HOST=0.0.0.0
SERVER_PORT=8880

# Pipeline Settings (optional overrides)
PIPELINE__STT__MODEL_SIZE=medium
PIPELINE__STT__DEVICE=cuda
PIPELINE__TTS__DEVICE=cuda

Where to Get Values:

DISCORD_BOT_TOKEN: Discord Developer Portal → Your Application → Bot → Token
OPENCLAW_AUTH_TOKEN: Check your NAS OpenClaw Gateway config or create new token
TTS_URL: Already running at 192.168.50.47:8004

6. Testing End-to-End Flow

Status: ⏳ PENDING

Test Plan:

A. Test OpenClaw Gateway Connection

# Create test script: test_gateway_connection.py

import asyncio
from openclaw_client import create_client

async def test_connection():
    client = create_client(
        base_url="ws://192.168.50.9:18789",
        auth_token="<your_token>",
        agent_id="main"
    )

    try:
        await client.connect()
        print("✓ Connected to Gateway")

        response = await client.send_message(
            agent="jarvis",
            message="Hello, this is a test",
            speaker="test_user"
        )
        print(f"✓ Received response: {response}")

        await client.disconnect()
        print("✓ Disconnected")

    except Exception as e:
        print(f"✗ Error: {e}")

asyncio.run(test_connection())

B. Test Discord Bot End-to-End

Start openclaw-voice bot:
```
python run.py
```
Join Discord voice channel

Use slash commands:

/join
/agent jarvis
/sensitivity medium

Speak into microphone:
- Bot should detect voice (VAD)
- Wait for Smart Turn completion
- Transcribe speech (STT)
- Check relevance
- Send to OpenClaw Gateway
- Generate TTS response
- Play audio back

Check logs for latency breakdown:

VAD: XXms
Smart Turn: XXms
STT: XXms
Relevance: XXms
Gateway: XXXXms
TTS: XXms
Total: ~3-7s

C. Test Agent Switching

/agent sage
[speak] "Tell me about philosophy"
[expect Sage's voice and personality]

/agent jarvis
[speak] "What's the weather?"
[expect Jarvis's voice and personality]

D. Test Relevance Filtering

/sensitivity low
[speak unrelated conversation]
[expect bot to stay quiet]

[speak "Hey Jarvis, ..." or "Jarvis, ..."]
[expect bot to respond]

/sensitivity high
[speak relevant question without name]
[expect bot to respond]

📋 Quick Start Checklist

To get openclaw-voice running with your OpenClaw Gateway:

~~Implement OpenClaw Gateway WebSocket client~~ ✅
~~Add websockets dependency~~ ✅
~~Update configuration files~~ ✅
Download Smart Turn v3 model from HuggingFace
Create .env file with your credentials
Modify server/tts.py to use your existing TTS server (Option A)
Wire OpenClawClient into bot initialization (run.py or discord_bot/bot.py)
Create LLM response handler wrapper for orchestrator
Test Gateway connection standalone
Install dependencies: pip install -r requirements.txt
Run end-to-end test with Discord voice

🎯 Next Steps

Complete Task #2: Download real Smart Turn model
Complete Task #3: Configure TTS (recommend Option A - use existing server)
Complete Task #4: Create .env with your credentials
Wire up the bot: Integrate OpenClawClient into Discord bot initialization
Complete Task #5: Test end-to-end flow

📚 Reference

Session Key Format

agent:<agentId>:discord:dm:<userId>

Examples:

agent:main:discord:dm:123456789 (user 123456789 talking to main agent)
agent:jarvis:discord:dm:987654321 (user 987654321 talking to jarvis agent)

Gateway Protocol Summary

1. WebSocket Connect
2. Server sends: connect.challenge (with nonce)
3. Client sends: connect request (with auth token)
4. Server sends: hello-ok response (with server info)
5. Client sends: chat.send (with sessionKey, message, idempotencyKey)
6. Server sends: ack response (with runId)
7. Server sends: delta events (streaming response)
8. Server sends: final event (complete response)

File Locations

OpenClaw Client: openclaw_client/client.py
Configuration: utils/config.py, config.yaml, .env
Bot Entry: run.py
Discord Bot: discord_bot/bot.py
Voice Sessions: discord_bot/voice_session.py
Pipeline: pipeline/orchestrator.py
TTS: server/tts.py

🐛 Troubleshooting

WebSocket Connection Fails

Verify Gateway is running: ssh Hyriel@192.168.50.9 'sudo /usr/local/bin/docker logs --tail 50 openclaw-gateway'
Check NAS firewall allows port 18789
Verify auth token is correct
Check logs for connection errors

Bot Doesn't Respond to Voice

Check VAD is detecting speech (logs should show "speech detected")
Verify STT model is downloaded (first run downloads ~500MB-5GB)
Check OpenClaw Gateway receives messages (NAS logs)
Verify TTS server is reachable: curl http://192.168.50.47:8004/health

Agent Switching Doesn't Work

Verify session management is passing current_agent to LLM handler
Check that session.current_agent is updated by /agent command
Verify Gateway session key uses correct agent ID

Status Summary: 40% Complete (2/5 major tasks done)

Estimated Time to Completion: 2-4 hours (with testing)

13 KiB Raw Permalink Blame History

OpenClaw Gateway Integration Status

✅ Completed Tasks

1. OpenClaw Gateway WebSocket Client Implementation

🔧 Remaining Integration Work

2. Connect OpenClaw Client to Discord Bot

A. Bot Initialization (in run.py or discord_bot/bot.py)

B. Pipeline Orchestrator Integration

C. Agent Selection Integration

D. Cleanup on Disconnect

3. Download Smart Turn v3 Model

4. Configure TTS to Use Existing Sage-Voice Server

Option A: Use Your Existing TTS Server (Recommended)

Option B: Use Built-in TTS (More Complex)

5. Environment Configuration

6. Testing End-to-End Flow

A. Test OpenClaw Gateway Connection

B. Test Discord Bot End-to-End

C. Test Agent Switching

D. Test Relevance Filtering

📋 Quick Start Checklist

🎯 Next Steps

📚 Reference

Session Key Format

Gateway Protocol Summary

File Locations

🐛 Troubleshooting

WebSocket Connection Fails

Bot Doesn't Respond to Voice

Agent Switching Doesn't Work

13 KiB

Raw Permalink Blame History

A. Bot Initialization (in `run.py` or `discord_bot/bot.py`)