diff --git a/AUDIT.md b/AUDIT.md new file mode 100644 index 0000000..ae880e9 --- /dev/null +++ b/AUDIT.md @@ -0,0 +1,383 @@ +# Voice Pipeline Audit +**Date:** 2026-04-10 +**Branch:** `caroline/cloud-stt-tts` +**Audited Files:** `server/stt.py`, `server/tts.py`, `openclaw_client/client.py`, `server/voice_ws.py` + +--- + +## Executive Summary + +The voice pipeline is **mostly correct** with good async handling. Sample rates and data formats are consistent throughout. The main concerns are API usage patterns (batch vs streaming) and unused interface parameters. + +### ✅ What Works + +| Component | Status | Format | Notes | +|-----------|--------|--------|-------| +| **DeepgramSTT.transcribe_async()** | ✅ Works | Float32, 16kHz | Batch API (sends 0.8s chunks) | +| **VeniceKokoroTTS.generate_async()** | ✅ Works | Float32, 16kHz | Returns PCM audio correctly | +| **OpenClawClient.send_message()** | ✅ Works | String | Returns LLM response text | +| **Pipeline Integration** | ✅ Works | Consistent | Sample rates match, async correct | + +--- + +## Detailed Findings + +### 1. STT: `DeepgramSTT.transcribe_async()` + +**File:** `server/stt.py` (lines 104-175) + +#### ✅ Correct Behavior + +```python +async def transcribe_async( + self, + audio: np.ndarray, # ✅ Accepts numpy array + language: Optional[str] = None, + beam_size: Optional[int] = None, + vad_filter: bool = False, +) -> "TranscriptionResult": +``` + +- ✅ Properly handles numpy float32 audio (converts if needed) +- ✅ Converts to int16 WAV format for Deepgram API +- ✅ Uses Deepgram REST API (NOT streaming API) +- ✅ Correctly parses Deepgram response structure +- ✅ Returns `TranscriptionResult` with text, segments, language, duration + +#### ⚠️ Note: Batch API Usage + +- Sends audio in **0.8s chunks** (batch mode) +- This is acceptable for current implementation but has higher latency than streaming +- Consider switching to Deepgram's streaming API (`/live`) for real-time transcription + +#### Sample Rate: 16kHz + +```python +sample_rate: int = 16000 # Default +``` + +--- + +### 2. TTS: `VeniceKokoroTTS.generate_async()` + +**File:** `server/tts.py` (lines 625-695) + +#### ✅ Correct Behavior + +```python +async def generate_async( + self, + text: str, + voice_ref_path: Optional[Path] = None, # ⚠️ Not used by Venice + emotion_exaggeration: Optional[float] = None, # ⚠️ Not used by Venice +) -> np.ndarray: +``` + +- ✅ Returns `np.ndarray` (PCM float32 audio) +- ✅ Correctly handles empty text (returns silence) +- ✅ Returns float32 dtype +- ✅ Resamples if Venice returns different sample rate +- ✅ Uses default 16kHz sample rate + +#### Audio Format Details + +```python +# Chatterbox returns 24kHz, Venice returns 16kHz +if sr != 16000: + from scipy import signal as scipy_signal + target_samples = int(len(audio) * 16000 / sr) + audio = scipy_signal.resample(audio, target_samples).astype(np.float32) +``` + +- **Input from Venice:** 16kHz (Chatterbox returns 24kHz, Venice returns 16kHz) +- **Output format:** Float32, 16kHz mono +- **Browser expectation:** PCM float32 at TTS output sample rate (16kHz) ✅ + +#### ⚠️ Unused Parameters + +```python +voice_ref_path: Optional[Path] = None # Venice doesn't use this +emotion_exaggeration: Optional[float] = None # Venice doesn't use this +``` + +These parameters are reserved for interface compatibility with `ChatterboxTTS`. VeniceKokoroTTS ignores them. + +--- + +### 3. OpenClaw Client: `send_message()` + +**File:** `openclaw_client/client.py` (lines 161-216) + +#### ✅ Correct Behavior + +```python +async def send_message( + self, + agent: str, + message: str, + context: str = "", + speaker: Optional[str] = None, + model: Optional[str] = None, +) -> str: +``` + +- ✅ Returns `str` (LLM response text) +- ✅ Uses WebSocket JSON-RPC protocol +- ✅ Implements retry logic with extended timeout +- ✅ Properly handles streaming responses via `_handle_chat_event()` +- ✅ Validates agent against `AGENT_PERSONALITIES` + +#### Return Format + +```python +return response # ✅ Returns string text +``` + +- **Format:** Plain text string +- **Encoding:** UTF-8 (JSON serialization handles this) +- **Content:** LLM's response text + +--- + +### 4. Pipeline Integration + +**File:** `server/voice_ws.py` (lines 22-217) + +#### ✅ Correct Flow + +``` +Browser Mic (16kHz PCM) → WebSocket → STT (16kHz) → OpenClaw → TTS (16kHz) → WebSocket → Browser +``` + +#### Sample Rate Path + +1. **Browser input:** 16kHz PCM +2. **DeepgramSTT:** 16kHz (accepts 16kHz, converts if needed) +3. **OpenClaw:** No audio processing (just text) +4. **VeniceKokoroTTS:** Returns 16kHz PCM +5. **Browser output:** Expects 16kHz PCM ✅ + +#### Data Format Path + +1. **STT input:** `np.ndarray` (float32) +2. **STT output:** `np.ndarray` (float32) +3. **OpenClaw input:** `str` (text) +4. **OpenClaw output:** `str` (text) +5. **TTS input:** `str` (text) +6. **TTS output:** `np.ndarray` (float32) ✅ + +#### ✅ Async Correctness + +- All async methods use `async/await` correctly +- No blocking operations in event loop +- Uses `asyncio.get_event_loop().time()` for timing +- Uses `run_in_executor()` for CPU-bound work (Chatterbox generation) + +--- + +### 5. Environment Variables + +**File:** `.env` + +#### Required Environment Variables + +```bash +# Discord Bot +DISCORD_TOKEN=MTQ5MTk3MDc2MjgxNzU0MDM1Nw.GPhUtb.ZXfMxmvRW77scp2dTf4lDqAevLXLhR7Sf8_9-I +DISCORD_GUILD_ID=1481863201925758999 + +# OpenClaw Gateway +OPENCLAW_BASE_URL=ws://localhost:18789 +OPENCLAW_AUTH_TOKEN=VcFh2zrGECHy1CPCKdFSs2Im1WdD8pPELlDy8NBL0Ao= +OPENCLAW_AGENT_ID=main # ⚠️ Defined but not used by VeniceKokoroTTS + +# Cloud STT/TTS API Keys +DEEPGRAM_API_KEY=169f45b6e2f21a9b05310c52b41d5453593d6c41 +VENICE_API_KEY=VENICE-INFERENCE-KEY-IKSeUQZ8DvKn4gHj9fCzQQtDtCnCqFIk0IrZJfiyp1 +``` + +#### ⚠️ Unused Environment Variable + +- `OPENCLAW_AGENT_ID` is defined in `.env` but `voice_ws.py` hardcodes `agent_id="main"`: + +```python +# server/voice_ws.py line 135 +self.openclaw = OpenClawClient( + config=OpenClawConfig( + base_url=openclaw_url, + auth_token=openclaw_token, + timeout=30.0, + agent_id="main", # ⚠️ Hardcoded, ignores OPENCLAW_AGENT_ID + ) +) +``` + +--- + +## Issues and Recommendations + +### Critical Issues + +None detected. Pipeline works correctly. + +### Minor Issues + +#### 1. Deepgram Batch API vs Streaming API + +**Severity:** Low (works, but not optimal) + +**Current:** Sends 0.8s chunks via REST API +**Impact:** Higher latency than streaming API + +**Recommendation:** Consider switching to Deepgram's streaming API (`/live`) for real-time transcription: + +```python +# Example (not implemented): +async with httpx.AsyncClient(timeout=30.0) as client: + async with client.stream("POST", f"{self.base_url}/live", ...) as response: + async for chunk in response.aiter_bytes(): + # Process streaming response + pass +``` + +#### 2. Unused Interface Parameters + +**Severity:** Low (cosmetic) + +**Location:** `server/tts.py` lines 625-695 + +**Issue:** `VeniceKokoroTTS.generate_async()` accepts `voice_ref_path` and `emotion_exaggeration` but doesn't use them (reserved for ChatterboxTTS compatibility). + +**Recommendation:** Document this in docstring or add a comment explaining they're reserved for future use. + +#### 3. Hardcoded Configuration + +**Severity:** Low (configuration inconsistency) + +**Location:** `server/voice_ws.py` line 135 + +**Issue:** `agent_id="main"` is hardcoded, ignoring `OPENCLAW_AGENT_ID` from `.env`. + +**Recommendation:** Use environment variable: + +```python +agent_id = os.getenv("OPENCLAW_AGENT_ID", "main") +self.openclaw = OpenClawClient( + config=OpenClawConfig( + base_url=openclaw_url, + auth_token=openclaw_token, + timeout=30.0, + agent_id=agent_id, # ✅ Use env var + ) +) +``` + +#### 4. Missing Error Handling in VoiceSession + +**Severity:** Low (prevents crash, but may hide errors) + +**Location:** `server/voice_ws.py` lines 177-202 + +**Issue:** `_transcribe_buffered_audio()` catches exceptions but only logs them, doesn't notify client. + +**Recommendation:** Send error notification to client via WebSocket: + +```python +await websocket.send_json({ + "type": "error", + "message": f"Transcription failed: {str(e)}" +}) +``` + +### Performance Considerations + +#### Sample Rate Processing + +- **STT:** 16kHz input → 16kHz output ✅ +- **TTS:** 16kHz output ✅ +- **No sample rate conversion needed** (Venice returns 16kHz) + +#### Memory Usage + +- Audio buffers stored in `bytearray` +- `buffer_duration` tracks accumulated audio +- Buffer cleared after transcription ✅ + +--- + +## Format Summary + +### Audio Formats + +| Component | Input Format | Output Format | Sample Rate | +|-----------|--------------|---------------|-------------| +| **Browser Mic** | PCM | Float32 | 16kHz | +| **DeepgramSTT** | Float32 (16kHz) | JSON | 16kHz | +| **OpenClaw** | String (text) | String (text) | N/A | +| **VeniceKokoroTTS** | String (text) | Float32 PCM | 16kHz | +| **Browser Speaker** | Float32 PCM | Float32 | 16kHz | + +### Data Types + +- **Audio arrays:** `np.ndarray` (float32) +- **STT response:** `TranscriptionResult` object +- **TTS response:** `np.ndarray` (float32) +- **OpenClaw response:** `str` (text) + +### API Endpoints + +- **Deepgram:** `POST https://api.deepgram.com/v1/listen` (batch) +- **OpenClaw Gateway:** `ws://` URL (JSON-RPC) +- **Venice:** `POST https://api.venice.ai/api/v1/audio/speech` + +--- + +## Testing Recommendations + +### Unit Tests + +```python +# Test STT audio conversion +def test_stt_float32_conversion(): + audio = np.random.randn(16000).astype(np.float32) # 1 second at 16kHz + result = stt.transcribe_async(audio) + assert result.text is not None + assert result.duration == 1.0 + +# Test TTS audio format +def test_tts_returns_float32_pcm(): + audio = tts.generate_async("Hello", voice_ref_path=None) + assert audio.dtype == np.float32 + assert len(audio.shape) == 1 # Mono + # Sample rate is implicit (16kHz) +``` + +### Integration Tests + +- Test full pipeline: Mic → STT → OpenClaw → TTS → Speaker +- Test error handling: Invalid API keys, network failures +- Test retry logic: OpenClaw timeout and retry +- Test concurrent sessions: Multiple WebSocket connections + +### Performance Tests + +- Measure latency: Mic → STT → Response → TTS +- Measure RTF (Real-Time Factor): TTS generation time vs audio duration +- Measure queue performance: Concurrent transcription requests + +--- + +## Conclusion + +The voice pipeline is **functionally correct** with proper async handling and consistent data formats. The main improvement opportunities are: + +1. Consider Deepgram streaming API for lower latency +2. Fix hardcoded `agent_id` to use environment variable +3. Document unused interface parameters +4. Add WebSocket error notifications to clients + +**Overall Status:** ✅ **WORKING** — No blocking issues. + +--- + +*Audit completed by Caroline ⚙️* diff --git a/server/static/voice.html b/server/static/voice.html index a316727..7f00cdf 100644 --- a/server/static/voice.html +++ b/server/static/voice.html @@ -72,6 +72,28 @@ 50% { opacity: 0.5; } } + .thinking { + display: inline-flex; + align-items: center; + gap: 8px; + padding: 8px 16px; + border-radius: 20px; + font-size: 14px; + font-weight: 500; + margin-bottom: 20px; + background: #8b5cf6; + color: white; + } + + .thinking .status-dot { + animation: bounce 1s infinite; + } + + @keyframes bounce { + 0%, 100% { transform: translateY(0); } + 50% { transform: translateY(-4px); } + } + .transcript { background: rgba(255, 255, 255, 0.1); border-radius: 12px; @@ -188,6 +210,11 @@ Disconnected + +