openclaw-voice/INTEGRATION_STATUS.md
MCKRUZ 9fde3d31ba feat: Major performance optimizations and feature enhancements
## Performance Optimizations (3-10x faster responses)
- STT beam_size reduced to 1 (3-5x faster transcription, minimal quality loss)
- Smart query routing: Haiku (simple) → Sonnet (medium) → Opus (complex)
- TTS cache for common phrases (27 pre-generated responses)
- Sentence-level streaming TTS (start playing while generating)
- Sample-based VAD timing (30x improvement in silence detection)

## TTS Engine Upgrade
- Migrated from Chatterbox to Chatterbox-Turbo
- Zero-shot voice cloning (no fine-tuning required)
- Native paralinguistic tag support ([laugh], [sigh], [chuckle], etc.)
- Emotion presets with temperature control
- Improved marker conversion (*action*, (action), ~action~)

## Discord Bot Enhancements
- Multi-agent support (Jarvis, Sage)
- Improved voice receiving with discord-ext-voice-recv
- Enhanced /join, /leave, /status commands
- Per-agent personality configuration
- Better audio sink/receiver implementation

## OpenClaw Integration
- WebSocket support for Gateway communication
- Query complexity routing (auto-select model)
- Improved error handling and retries
- Session management per Discord guild
- Better latency tracking

## Pipeline Improvements
- Sentence splitter for streaming optimization
- Query router for intelligent model selection
- Enhanced VAD receiver with sample-based timing
- Improved audio buffering and format conversion
- Better transcript management

## Documentation
- Added QUICK_START.md (5-minute test guide)
- Added OPTIMIZATION_SUMMARY.md (performance analysis)
- Added DISCORD_OPTIMIZATION_TEST.md (testing guide)
- Added USAGE_GUIDE.md (comprehensive usage)
- Updated README.md with optimization details

## Utilities & Scripts
- Added get_invite_link.py (Discord bot invite)
- Added sync_commands.py, sync_to_guild.py (command sync)
- Added test_gateway.py, test_stt.py (testing utilities)
- Added openclaw_wrapper.py (wrapper script)
- Removed create_mock_turn_model.py (no longer needed)

## Configuration Updates
- STT model: medium → small (faster, acceptable quality)
- TTS engine: chatterbox → coqui (Turbo integration)
- Beam size: 5 → 1 (latency optimization)
- Added emotion_exaggeration per agent
- Updated .gitignore for project files

Total: ~2105 insertions, ~462 deletions across 35 files
Performance: ~5.5s total latency (down from 22-35s)
Target: ~3.5s (achieved in simple queries with cache)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-16 19:29:57 -05:00

479 lines
13 KiB
Markdown

# OpenClaw Gateway Integration Status
**Last Updated**: 2026-02-13
## ✅ Completed Tasks
### 1. OpenClaw Gateway WebSocket Client Implementation
**Status**: ✅ **COMPLETE**
**Location**: `openclaw_client/client.py`
**Changes Made**:
- ✅ Implemented full WebSocket JSON-RPC protocol
- ✅ Added connect handshake (`connect.challenge``connect``hello-ok`)
- ✅ Implemented chat.send with event listening (delta → final)
- ✅ Added session key generation (`agent:<agentId>:discord:dm:<userId>`)
- ✅ Implemented automatic reconnection logic
- ✅ Added per-guild client management via `PerGuildOpenClawClient`
- ✅ Preserved existing `send_message()` interface for compatibility
- ✅ Added connection statistics and latency tracking
**Protocol Flow**:
```
WebSocket Connect → connect.challenge → connect request → hello-ok response
chat.send (with sessionKey, idempotencyKey) → ack (with runId) → delta events → final event
```
**Configuration**:
- ✅ Updated `utils/config.py` to support WebSocket URL format
- ✅ Added `agent_id` and `session_scope` configuration options
- ✅ Added `retry_timeout` for extended retry attempts
- ✅ Updated `config.yaml` openclaw section with WebSocket settings
- ✅ Updated `.env.example` with WebSocket URL format and auth token
**Dependencies**:
- ✅ Added `websockets>=12.0` to `requirements.txt`
**Testing**:
- ⚠️ Existing unit tests need updates for WebSocket client
- ⚠️ Integration tests need real Gateway connection
---
## 🔧 Remaining Integration Work
### 2. Connect OpenClaw Client to Discord Bot
**Status**: ⏳ **PENDING**
**What Needs to be Done**:
The OpenClawClient is implemented but not yet wired into the Discord bot pipeline. Here's what needs to happen:
#### A. Bot Initialization (in `run.py` or `discord_bot/bot.py`)
Create and initialize the OpenClaw Gateway client on bot startup:
```python
# In run.py, after loading config:
from openclaw_client import OpenClawConfig, PerGuildOpenClawClient
# Create OpenClaw Gateway client configuration
openclaw_config = OpenClawConfig(
base_url=config.openclaw.base_url, # ws://192.168.50.9:18789
auth_token=config.openclaw.token,
timeout=config.openclaw.timeout,
retry_timeout=config.openclaw.retry_timeout,
agent_id=config.openclaw.agent_id,
session_scope=config.openclaw.session_scope,
)
# Create per-guild client manager
openclaw_client = PerGuildOpenClawClient(openclaw_config)
# Connect to Gateway
logger.info("Connecting to OpenClaw Gateway...")
# Note: Connection happens lazily on first message, or explicitly:
# await openclaw_client.get_or_create(guild_id).connect()
```
#### B. Pipeline Orchestrator Integration
The orchestrator expects an `llm_client` callable. Create a wrapper:
```python
# In voice session or orchestrator setup:
async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
"""Wrapper for OpenClaw Gateway client."""
client = openclaw_client.get_or_create(guild_id)
return await client.send_message(
agent=agent,
message=message,
context="", # Gateway manages context internally
speaker=str(user_id) # Used for session key generation
)
# Pass to orchestrator:
orchestrator = PipelineOrchestrator(
config=pipeline_config,
vad=vad,
turn_detector=turn_detector,
transcriber=transcriber,
transcript_manager=transcript_manager,
relevance_classifier=relevance_classifier,
llm_client=llm_response_handler, # ← Use wrapper
tts_synthesizer=tts_synthesizer,
audio_output_callback=audio_callback,
)
```
#### C. Agent Selection Integration
The `VoiceSession` tracks `current_agent` per guild. Ensure this is passed to the LLM handler:
```python
async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
# Get current agent from session
session = session_manager.get_session(guild_id)
current_agent = session.current_agent if session else "jarvis"
# Send to Gateway with correct agent
client = openclaw_client.get_or_create(guild_id)
return await client.send_message(
agent=current_agent, # Use session's agent setting
message=message,
speaker=str(user_id)
)
```
#### D. Cleanup on Disconnect
When bot disconnects from Discord or guild, close Gateway connection:
```python
# In voice session cleanup:
async def cleanup_guild(guild_id: int):
# Remove voice session
await session_manager.remove_session(guild_id)
# Disconnect OpenClaw client for this guild
client = openclaw_client.get_or_create(guild_id)
await client.disconnect()
openclaw_client.remove_guild(guild_id)
```
---
### 3. Download Smart Turn v3 Model
**Status**: ⏳ **PENDING**
**Current State**:
- Mock ONNX model at `models/smart_turn_v3.onnx` (164 bytes placeholder)
- Mock creation script at `scripts/create_mock_turn_model.py`
**What to Do**:
```bash
# Install huggingface_hub if not already installed
pip install huggingface_hub
# Download real model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='pipecat-ai/smart-turn-v3', filename='model.onnx', local_dir='models/')"
# Remove mock files
rm models/smart_turn_v3.onnx
rm scripts/create_mock_turn_model.py
# Verify model exists and is ~8MB
ls -lh models/model.onnx
```
---
### 4. Configure TTS to Use Existing Sage-Voice Server
**Status**: ⏳ **PENDING**
**Decision Point**: You have two TTS options:
#### Option A: Use Your Existing TTS Server (Recommended)
Your sage-voice server at `http://192.168.50.47:8004` already works and has your voice models.
**Modify `server/tts.py`** to use HTTP client instead of built-in TTS:
```python
# Replace Chatterbox/Coqui implementation with HTTP client
import httpx
class TTSSynthesizer:
def __init__(self, tts_url: str, device: str = "cuda"):
self.tts_url = tts_url # http://192.168.50.47:8004
self.device = device
async def synthesize(
self,
text: str,
voice: str,
response_format: str = "pcm"
) -> bytes:
"""Call sage-voice TTS server."""
async with httpx.AsyncClient() as client:
response = await client.post(
f"{self.tts_url}/v1/audio/speech",
json={
"input": text,
"voice": voice, # jarvis or sage
"response_format": response_format
},
timeout=10.0
)
return response.content
```
**Add to `.env`**:
```bash
TTS_URL=http://192.168.50.47:8004
```
#### Option B: Use Built-in TTS (More Complex)
Provide voice reference files and use Coqui XTTS:
- Place `server/voices/jarvis.wav` (10-30 seconds clean audio)
- Place `server/voices/sage.wav` (10-30 seconds clean audio)
- Keep existing `server/tts.py` implementation
**Recommendation**: Go with **Option A** to reuse your proven TTS infrastructure.
---
### 5. Environment Configuration
**Status**: ⏳ **PENDING**
**Create `.env` file** in openclaw-voice directory:
```bash
# Copy example
cp .env.example .env
# Edit with your actual values
```
**Required Configuration**:
```bash
# Discord Bot (from Discord Developer Portal)
DISCORD_BOT_TOKEN=<your_discord_bot_token>
# OpenClaw Gateway (on Synology NAS)
OPENCLAW_BASE_URL=ws://192.168.50.9:18789
OPENCLAW_AUTH_TOKEN=<your_gateway_token>
OPENCLAW_AGENT_ID=main
# TTS Server (your existing sage-voice server)
TTS_URL=http://192.168.50.47:8004
# FastAPI Server (openclaw-voice API endpoints)
SERVER_HOST=0.0.0.0
SERVER_PORT=8880
# Pipeline Settings (optional overrides)
PIPELINE__STT__MODEL_SIZE=medium
PIPELINE__STT__DEVICE=cuda
PIPELINE__TTS__DEVICE=cuda
```
**Where to Get Values**:
- `DISCORD_BOT_TOKEN`: Discord Developer Portal → Your Application → Bot → Token
- `OPENCLAW_AUTH_TOKEN`: Check your NAS OpenClaw Gateway config or create new token
- TTS_URL: Already running at `192.168.50.47:8004`
---
### 6. Testing End-to-End Flow
**Status**: ⏳ **PENDING**
**Test Plan**:
#### A. Test OpenClaw Gateway Connection
```python
# Create test script: test_gateway_connection.py
import asyncio
from openclaw_client import create_client
async def test_connection():
client = create_client(
base_url="ws://192.168.50.9:18789",
auth_token="<your_token>",
agent_id="main"
)
try:
await client.connect()
print("✓ Connected to Gateway")
response = await client.send_message(
agent="jarvis",
message="Hello, this is a test",
speaker="test_user"
)
print(f"✓ Received response: {response}")
await client.disconnect()
print("✓ Disconnected")
except Exception as e:
print(f"✗ Error: {e}")
asyncio.run(test_connection())
```
#### B. Test Discord Bot End-to-End
1. Start openclaw-voice bot:
```bash
python run.py
```
2. Join Discord voice channel
3. Use slash commands:
```
/join
/agent jarvis
/sensitivity medium
```
4. Speak into microphone:
- Bot should detect voice (VAD)
- Wait for Smart Turn completion
- Transcribe speech (STT)
- Check relevance
- Send to OpenClaw Gateway
- Generate TTS response
- Play audio back
5. Check logs for latency breakdown:
```
VAD: XXms
Smart Turn: XXms
STT: XXms
Relevance: XXms
Gateway: XXXXms
TTS: XXms
Total: ~3-7s
```
#### C. Test Agent Switching
```
/agent sage
[speak] "Tell me about philosophy"
[expect Sage's voice and personality]
/agent jarvis
[speak] "What's the weather?"
[expect Jarvis's voice and personality]
```
#### D. Test Relevance Filtering
```
/sensitivity low
[speak unrelated conversation]
[expect bot to stay quiet]
[speak "Hey Jarvis, ..." or "Jarvis, ..."]
[expect bot to respond]
/sensitivity high
[speak relevant question without name]
[expect bot to respond]
```
---
## 📋 Quick Start Checklist
To get openclaw-voice running with your OpenClaw Gateway:
- [x] ~~Implement OpenClaw Gateway WebSocket client~~ ✅
- [x] ~~Add websockets dependency~~ ✅
- [x] ~~Update configuration files~~ ✅
- [ ] Download Smart Turn v3 model from HuggingFace
- [ ] Create `.env` file with your credentials
- [ ] Modify `server/tts.py` to use your existing TTS server (Option A)
- [ ] Wire OpenClawClient into bot initialization (`run.py` or `discord_bot/bot.py`)
- [ ] Create LLM response handler wrapper for orchestrator
- [ ] Test Gateway connection standalone
- [ ] Install dependencies: `pip install -r requirements.txt`
- [ ] Run end-to-end test with Discord voice
---
## 🎯 Next Steps
1. **Complete Task #2**: Download real Smart Turn model
2. **Complete Task #3**: Configure TTS (recommend Option A - use existing server)
3. **Complete Task #4**: Create .env with your credentials
4. **Wire up the bot**: Integrate OpenClawClient into Discord bot initialization
5. **Complete Task #5**: Test end-to-end flow
---
## 📚 Reference
### Session Key Format
```
agent:<agentId>:discord:dm:<userId>
```
Examples:
- `agent:main:discord:dm:123456789` (user 123456789 talking to main agent)
- `agent:jarvis:discord:dm:987654321` (user 987654321 talking to jarvis agent)
### Gateway Protocol Summary
```
1. WebSocket Connect
2. Server sends: connect.challenge (with nonce)
3. Client sends: connect request (with auth token)
4. Server sends: hello-ok response (with server info)
5. Client sends: chat.send (with sessionKey, message, idempotencyKey)
6. Server sends: ack response (with runId)
7. Server sends: delta events (streaming response)
8. Server sends: final event (complete response)
```
### File Locations
- **OpenClaw Client**: `openclaw_client/client.py`
- **Configuration**: `utils/config.py`, `config.yaml`, `.env`
- **Bot Entry**: `run.py`
- **Discord Bot**: `discord_bot/bot.py`
- **Voice Sessions**: `discord_bot/voice_session.py`
- **Pipeline**: `pipeline/orchestrator.py`
- **TTS**: `server/tts.py`
---
## 🐛 Troubleshooting
### WebSocket Connection Fails
- Verify Gateway is running: `ssh Hyriel@192.168.50.9 'sudo /usr/local/bin/docker logs --tail 50 openclaw-gateway'`
- Check NAS firewall allows port 18789
- Verify auth token is correct
- Check logs for connection errors
### Bot Doesn't Respond to Voice
- Check VAD is detecting speech (logs should show "speech detected")
- Verify STT model is downloaded (first run downloads ~500MB-5GB)
- Check OpenClaw Gateway receives messages (NAS logs)
- Verify TTS server is reachable: `curl http://192.168.50.47:8004/health`
### Agent Switching Doesn't Work
- Verify session management is passing `current_agent` to LLM handler
- Check that `session.current_agent` is updated by `/agent` command
- Verify Gateway session key uses correct agent ID
---
**Status Summary**: 40% Complete (2/5 major tasks done)
**Estimated Time to Completion**: 2-4 hours (with testing)