openclaw-voice/INTEGRATION_STATUS.md

# OpenClaw Gateway Integration Status

**Last Updated**: 2026-02-13

## ✅ Completed Tasks

### 1. OpenClaw Gateway WebSocket Client Implementation

**Status**: ✅ **COMPLETE**

**Location**: `openclaw_client/client.py`

**Changes Made**:
- ✅ Implemented full WebSocket JSON-RPC protocol
- ✅ Added connect handshake (`connect.challenge` → `connect` → `hello-ok`)
- ✅ Implemented chat.send with event listening (delta → final)
- ✅ Added session key generation (`agent:<agentId>:discord:dm:<userId>`)
- ✅ Implemented automatic reconnection logic
- ✅ Added per-guild client management via `PerGuildOpenClawClient`
- ✅ Preserved existing `send_message()` interface for compatibility
- ✅ Added connection statistics and latency tracking

**Protocol Flow**:
```
WebSocket Connect → connect.challenge → connect request → hello-ok response
↓
chat.send (with sessionKey, idempotencyKey) → ack (with runId) → delta events → final event
```

**Configuration**:
- ✅ Updated `utils/config.py` to support WebSocket URL format
- ✅ Added `agent_id` and `session_scope` configuration options
- ✅ Added `retry_timeout` for extended retry attempts
- ✅ Updated `config.yaml` openclaw section with WebSocket settings
- ✅ Updated `.env.example` with WebSocket URL format and auth token

**Dependencies**:
- ✅ Added `websockets>=12.0` to `requirements.txt`

**Testing**:
- ⚠️ Existing unit tests need updates for WebSocket client
- ⚠️ Integration tests need real Gateway connection

---

## 🔧 Remaining Integration Work

### 2. Connect OpenClaw Client to Discord Bot

**Status**: ⏳ **PENDING**

**What Needs to be Done**:

The OpenClawClient is implemented but not yet wired into the Discord bot pipeline. Here's what needs to happen:

#### A. Bot Initialization (in `run.py` or `discord_bot/bot.py`)

Create and initialize the OpenClaw Gateway client on bot startup:

```python
# In run.py, after loading config:

from openclaw_client import OpenClawConfig, PerGuildOpenClawClient

# Create OpenClaw Gateway client configuration
openclaw_config = OpenClawConfig(
    base_url=config.openclaw.base_url,  # ws://192.168.50.9:18789
    auth_token=config.openclaw.token,
    timeout=config.openclaw.timeout,
    retry_timeout=config.openclaw.retry_timeout,
    agent_id=config.openclaw.agent_id,
    session_scope=config.openclaw.session_scope,
)

# Create per-guild client manager
openclaw_client = PerGuildOpenClawClient(openclaw_config)

# Connect to Gateway
logger.info("Connecting to OpenClaw Gateway...")
# Note: Connection happens lazily on first message, or explicitly:
# await openclaw_client.get_or_create(guild_id).connect()
```

#### B. Pipeline Orchestrator Integration

The orchestrator expects an `llm_client` callable. Create a wrapper:

```python
# In voice session or orchestrator setup:

async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
    """Wrapper for OpenClaw Gateway client."""
    client = openclaw_client.get_or_create(guild_id)
    return await client.send_message(
        agent=agent,
        message=message,
        context="",  # Gateway manages context internally
        speaker=str(user_id)  # Used for session key generation
    )

# Pass to orchestrator:
orchestrator = PipelineOrchestrator(
    config=pipeline_config,
    vad=vad,
    turn_detector=turn_detector,
    transcriber=transcriber,
    transcript_manager=transcript_manager,
    relevance_classifier=relevance_classifier,
    llm_client=llm_response_handler,  # ← Use wrapper
    tts_synthesizer=tts_synthesizer,
    audio_output_callback=audio_callback,
)
```

#### C. Agent Selection Integration

The `VoiceSession` tracks `current_agent` per guild. Ensure this is passed to the LLM handler:

```python
async def llm_response_handler(agent: str, message: str, user_id: int, guild_id: int) -> str:
    # Get current agent from session
    session = session_manager.get_session(guild_id)
    current_agent = session.current_agent if session else "jarvis"

    # Send to Gateway with correct agent
    client = openclaw_client.get_or_create(guild_id)
    return await client.send_message(
        agent=current_agent,  # Use session's agent setting
        message=message,
        speaker=str(user_id)
    )
```

#### D. Cleanup on Disconnect

When bot disconnects from Discord or guild, close Gateway connection:

```python
# In voice session cleanup:

async def cleanup_guild(guild_id: int):
    # Remove voice session
    await session_manager.remove_session(guild_id)

    # Disconnect OpenClaw client for this guild
    client = openclaw_client.get_or_create(guild_id)
    await client.disconnect()
    openclaw_client.remove_guild(guild_id)
```

---

### 3. Download Smart Turn v3 Model

**Status**: ⏳ **PENDING**

**Current State**:
- Mock ONNX model at `models/smart_turn_v3.onnx` (164 bytes placeholder)
- Mock creation script at `scripts/create_mock_turn_model.py`

**What to Do**:

```bash
# Install huggingface_hub if not already installed
pip install huggingface_hub

# Download real model
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='pipecat-ai/smart-turn-v3', filename='model.onnx', local_dir='models/')"

# Remove mock files
rm models/smart_turn_v3.onnx
rm scripts/create_mock_turn_model.py

# Verify model exists and is ~8MB
ls -lh models/model.onnx
```

---

### 4. Configure TTS to Use Existing Sage-Voice Server

**Status**: ⏳ **PENDING**

**Decision Point**: You have two TTS options:

#### Option A: Use Your Existing TTS Server (Recommended)

Your sage-voice server at `http://192.168.50.47:8004` already works and has your voice models.

**Modify `server/tts.py`** to use HTTP client instead of built-in TTS:

```python
# Replace Chatterbox/Coqui implementation with HTTP client

import httpx

class TTSSynthesizer:
    def __init__(self, tts_url: str, device: str = "cuda"):
        self.tts_url = tts_url  # http://192.168.50.47:8004
        self.device = device

    async def synthesize(
        self,
        text: str,
        voice: str,
        response_format: str = "pcm"
    ) -> bytes:
        """Call sage-voice TTS server."""
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.tts_url}/v1/audio/speech",
                json={
                    "input": text,
                    "voice": voice,  # jarvis or sage
                    "response_format": response_format
                },
                timeout=10.0
            )
            return response.content
```

**Add to `.env`**:
```bash
TTS_URL=http://192.168.50.47:8004
```

#### Option B: Use Built-in TTS (More Complex)

Provide voice reference files and use Coqui XTTS:
- Place `server/voices/jarvis.wav` (10-30 seconds clean audio)
- Place `server/voices/sage.wav` (10-30 seconds clean audio)
- Keep existing `server/tts.py` implementation

**Recommendation**: Go with **Option A** to reuse your proven TTS infrastructure.

---

### 5. Environment Configuration

**Status**: ⏳ **PENDING**

**Create `.env` file** in openclaw-voice directory:

```bash
# Copy example
cp .env.example .env

# Edit with your actual values
```

**Required Configuration**:

```bash
# Discord Bot (from Discord Developer Portal)
DISCORD_BOT_TOKEN=<your_discord_bot_token>

# OpenClaw Gateway (on Synology NAS)
OPENCLAW_BASE_URL=ws://192.168.50.9:18789
OPENCLAW_AUTH_TOKEN=<your_gateway_token>
OPENCLAW_AGENT_ID=main

# TTS Server (your existing sage-voice server)
TTS_URL=http://192.168.50.47:8004

# FastAPI Server (openclaw-voice API endpoints)
SERVER_HOST=0.0.0.0
SERVER_PORT=8880

# Pipeline Settings (optional overrides)
PIPELINE__STT__MODEL_SIZE=medium
PIPELINE__STT__DEVICE=cuda
PIPELINE__TTS__DEVICE=cuda
```

**Where to Get Values**:
- `DISCORD_BOT_TOKEN`: Discord Developer Portal → Your Application → Bot → Token
- `OPENCLAW_AUTH_TOKEN`: Check your NAS OpenClaw Gateway config or create new token
- TTS_URL: Already running at `192.168.50.47:8004`

---

### 6. Testing End-to-End Flow

**Status**: ⏳ **PENDING**

**Test Plan**:

#### A. Test OpenClaw Gateway Connection

```python
# Create test script: test_gateway_connection.py

import asyncio
from openclaw_client import create_client

async def test_connection():
    client = create_client(
        base_url="ws://192.168.50.9:18789",
        auth_token="<your_token>",
        agent_id="main"
    )

    try:
        await client.connect()
        print("✓ Connected to Gateway")

        response = await client.send_message(
            agent="jarvis",
            message="Hello, this is a test",
            speaker="test_user"
        )
        print(f"✓ Received response: {response}")

        await client.disconnect()
        print("✓ Disconnected")

    except Exception as e:
        print(f"✗ Error: {e}")

asyncio.run(test_connection())
```

#### B. Test Discord Bot End-to-End

1. Start openclaw-voice bot:
   ```bash
   python run.py
   ```

2. Join Discord voice channel

3. Use slash commands:
   ```
   /join
   /agent jarvis
   /sensitivity medium
   ```

4. Speak into microphone:
   - Bot should detect voice (VAD)
   - Wait for Smart Turn completion
   - Transcribe speech (STT)
   - Check relevance
   - Send to OpenClaw Gateway
   - Generate TTS response
   - Play audio back

5. Check logs for latency breakdown:
   ```
   VAD: XXms
   Smart Turn: XXms
   STT: XXms
   Relevance: XXms
   Gateway: XXXXms
   TTS: XXms
   Total: ~3-7s
   ```

#### C. Test Agent Switching

```
/agent sage
[speak] "Tell me about philosophy"
[expect Sage's voice and personality]

/agent jarvis
[speak] "What's the weather?"
[expect Jarvis's voice and personality]
```

#### D. Test Relevance Filtering

```
/sensitivity low
[speak unrelated conversation]
[expect bot to stay quiet]

[speak "Hey Jarvis, ..." or "Jarvis, ..."]
[expect bot to respond]

/sensitivity high
[speak relevant question without name]
[expect bot to respond]
```

---

## 📋 Quick Start Checklist

To get openclaw-voice running with your OpenClaw Gateway:

- [x] ~~Implement OpenClaw Gateway WebSocket client~~ ✅
- [x] ~~Add websockets dependency~~ ✅
- [x] ~~Update configuration files~~ ✅
- [ ] Download Smart Turn v3 model from HuggingFace
- [ ] Create `.env` file with your credentials
- [ ] Modify `server/tts.py` to use your existing TTS server (Option A)
- [ ] Wire OpenClawClient into bot initialization (`run.py` or `discord_bot/bot.py`)
- [ ] Create LLM response handler wrapper for orchestrator
- [ ] Test Gateway connection standalone
- [ ] Install dependencies: `pip install -r requirements.txt`
- [ ] Run end-to-end test with Discord voice

---

## 🎯 Next Steps

1. **Complete Task #2**: Download real Smart Turn model
2. **Complete Task #3**: Configure TTS (recommend Option A - use existing server)
3. **Complete Task #4**: Create .env with your credentials
4. **Wire up the bot**: Integrate OpenClawClient into Discord bot initialization
5. **Complete Task #5**: Test end-to-end flow

---

## 📚 Reference

### Session Key Format

```
agent:<agentId>:discord:dm:<userId>
```

Examples:
- `agent:main:discord:dm:123456789` (user 123456789 talking to main agent)
- `agent:jarvis:discord:dm:987654321` (user 987654321 talking to jarvis agent)

### Gateway Protocol Summary

```
1. WebSocket Connect
2. Server sends: connect.challenge (with nonce)
3. Client sends: connect request (with auth token)
4. Server sends: hello-ok response (with server info)
5. Client sends: chat.send (with sessionKey, message, idempotencyKey)
6. Server sends: ack response (with runId)
7. Server sends: delta events (streaming response)
8. Server sends: final event (complete response)
```

### File Locations

- **OpenClaw Client**: `openclaw_client/client.py`
- **Configuration**: `utils/config.py`, `config.yaml`, `.env`
- **Bot Entry**: `run.py`
- **Discord Bot**: `discord_bot/bot.py`
- **Voice Sessions**: `discord_bot/voice_session.py`
- **Pipeline**: `pipeline/orchestrator.py`
- **TTS**: `server/tts.py`

---

## 🐛 Troubleshooting

### WebSocket Connection Fails

- Verify Gateway is running: `ssh Hyriel@192.168.50.9 'sudo /usr/local/bin/docker logs --tail 50 openclaw-gateway'`
- Check NAS firewall allows port 18789
- Verify auth token is correct
- Check logs for connection errors

### Bot Doesn't Respond to Voice

- Check VAD is detecting speech (logs should show "speech detected")
- Verify STT model is downloaded (first run downloads ~500MB-5GB)
- Check OpenClaw Gateway receives messages (NAS logs)
- Verify TTS server is reachable: `curl http://192.168.50.47:8004/health`

### Agent Switching Doesn't Work

- Verify session management is passing `current_agent` to LLM handler
- Check that `session.current_agent` is updated by `/agent` command
- Verify Gateway session key uses correct agent ID

---

**Status Summary**: 40% Complete (2/5 major tasks done)

**Estimated Time to Completion**: 2-4 hours (with testing)