# Voice Pipeline Audit **Date:** 2026-04-10 **Branch:** `caroline/cloud-stt-tts` **Audited Files:** `server/stt.py`, `server/tts.py`, `openclaw_client/client.py`, `server/voice_ws.py` --- ## Executive Summary The voice pipeline is **mostly correct** with good async handling. Sample rates and data formats are consistent throughout. The main concerns are API usage patterns (batch vs streaming) and unused interface parameters. ### ✅ What Works | Component | Status | Format | Notes | |-----------|--------|--------|-------| | **DeepgramSTT.transcribe_async()** | ✅ Works | Float32, 16kHz | Batch API (sends 0.8s chunks) | | **VeniceKokoroTTS.generate_async()** | ✅ Works | Float32, 16kHz | Returns PCM audio correctly | | **OpenClawClient.send_message()** | ✅ Works | String | Returns LLM response text | | **Pipeline Integration** | ✅ Works | Consistent | Sample rates match, async correct | --- ## Detailed Findings ### 1. STT: `DeepgramSTT.transcribe_async()` **File:** `server/stt.py` (lines 104-175) #### ✅ Correct Behavior ```python async def transcribe_async( self, audio: np.ndarray, # ✅ Accepts numpy array language: Optional[str] = None, beam_size: Optional[int] = None, vad_filter: bool = False, ) -> "TranscriptionResult": ``` - ✅ Properly handles numpy float32 audio (converts if needed) - ✅ Converts to int16 WAV format for Deepgram API - ✅ Uses Deepgram REST API (NOT streaming API) - ✅ Correctly parses Deepgram response structure - ✅ Returns `TranscriptionResult` with text, segments, language, duration #### ⚠️ Note: Batch API Usage - Sends audio in **0.8s chunks** (batch mode) - This is acceptable for current implementation but has higher latency than streaming - Consider switching to Deepgram's streaming API (`/live`) for real-time transcription #### Sample Rate: 16kHz ```python sample_rate: int = 16000 # Default ``` --- ### 2. TTS: `VeniceKokoroTTS.generate_async()` **File:** `server/tts.py` (lines 625-695) #### ✅ Correct Behavior ```python async def generate_async( self, text: str, voice_ref_path: Optional[Path] = None, # ⚠️ Not used by Venice emotion_exaggeration: Optional[float] = None, # ⚠️ Not used by Venice ) -> np.ndarray: ``` - ✅ Returns `np.ndarray` (PCM float32 audio) - ✅ Correctly handles empty text (returns silence) - ✅ Returns float32 dtype - ✅ Resamples if Venice returns different sample rate - ✅ Uses default 16kHz sample rate #### Audio Format Details ```python # Chatterbox returns 24kHz, Venice returns 16kHz if sr != 16000: from scipy import signal as scipy_signal target_samples = int(len(audio) * 16000 / sr) audio = scipy_signal.resample(audio, target_samples).astype(np.float32) ``` - **Input from Venice:** 16kHz (Chatterbox returns 24kHz, Venice returns 16kHz) - **Output format:** Float32, 16kHz mono - **Browser expectation:** PCM float32 at TTS output sample rate (16kHz) ✅ #### ⚠️ Unused Parameters ```python voice_ref_path: Optional[Path] = None # Venice doesn't use this emotion_exaggeration: Optional[float] = None # Venice doesn't use this ``` These parameters are reserved for interface compatibility with `ChatterboxTTS`. VeniceKokoroTTS ignores them. --- ### 3. OpenClaw Client: `send_message()` **File:** `openclaw_client/client.py` (lines 161-216) #### ✅ Correct Behavior ```python async def send_message( self, agent: str, message: str, context: str = "", speaker: Optional[str] = None, model: Optional[str] = None, ) -> str: ``` - ✅ Returns `str` (LLM response text) - ✅ Uses WebSocket JSON-RPC protocol - ✅ Implements retry logic with extended timeout - ✅ Properly handles streaming responses via `_handle_chat_event()` - ✅ Validates agent against `AGENT_PERSONALITIES` #### Return Format ```python return response # ✅ Returns string text ``` - **Format:** Plain text string - **Encoding:** UTF-8 (JSON serialization handles this) - **Content:** LLM's response text --- ### 4. Pipeline Integration **File:** `server/voice_ws.py` (lines 22-217) #### ✅ Correct Flow ``` Browser Mic (16kHz PCM) → WebSocket → STT (16kHz) → OpenClaw → TTS (16kHz) → WebSocket → Browser ``` #### Sample Rate Path 1. **Browser input:** 16kHz PCM 2. **DeepgramSTT:** 16kHz (accepts 16kHz, converts if needed) 3. **OpenClaw:** No audio processing (just text) 4. **VeniceKokoroTTS:** Returns 16kHz PCM 5. **Browser output:** Expects 16kHz PCM ✅ #### Data Format Path 1. **STT input:** `np.ndarray` (float32) 2. **STT output:** `np.ndarray` (float32) 3. **OpenClaw input:** `str` (text) 4. **OpenClaw output:** `str` (text) 5. **TTS input:** `str` (text) 6. **TTS output:** `np.ndarray` (float32) ✅ #### ✅ Async Correctness - All async methods use `async/await` correctly - No blocking operations in event loop - Uses `asyncio.get_event_loop().time()` for timing - Uses `run_in_executor()` for CPU-bound work (Chatterbox generation) --- ### 5. Environment Variables **File:** `.env` #### Required Environment Variables ```bash # Discord Bot DISCORD_TOKEN=MTQ5MTk3MDc2MjgxNzU0MDM1Nw.GPhUtb.ZXfMxmvRW77scp2dTf4lDqAevLXLhR7Sf8_9-I DISCORD_GUILD_ID=1481863201925758999 # OpenClaw Gateway OPENCLAW_BASE_URL=ws://localhost:18789 OPENCLAW_AUTH_TOKEN=VcFh2zrGECHy1CPCKdFSs2Im1WdD8pPELlDy8NBL0Ao= OPENCLAW_AGENT_ID=main # ⚠️ Defined but not used by VeniceKokoroTTS # Cloud STT/TTS API Keys DEEPGRAM_API_KEY=169f45b6e2f21a9b05310c52b41d5453593d6c41 VENICE_API_KEY=VENICE-INFERENCE-KEY-IKSeUQZ8DvKn4gHj9fCzQQtDtCnCqFIk0IrZJfiyp1 ``` #### ⚠️ Unused Environment Variable - `OPENCLAW_AGENT_ID` is defined in `.env` but `voice_ws.py` hardcodes `agent_id="main"`: ```python # server/voice_ws.py line 135 self.openclaw = OpenClawClient( config=OpenClawConfig( base_url=openclaw_url, auth_token=openclaw_token, timeout=30.0, agent_id="main", # ⚠️ Hardcoded, ignores OPENCLAW_AGENT_ID ) ) ``` --- ## Issues and Recommendations ### Critical Issues None detected. Pipeline works correctly. ### Minor Issues #### 1. Deepgram Batch API vs Streaming API **Severity:** Low (works, but not optimal) **Current:** Sends 0.8s chunks via REST API **Impact:** Higher latency than streaming API **Recommendation:** Consider switching to Deepgram's streaming API (`/live`) for real-time transcription: ```python # Example (not implemented): async with httpx.AsyncClient(timeout=30.0) as client: async with client.stream("POST", f"{self.base_url}/live", ...) as response: async for chunk in response.aiter_bytes(): # Process streaming response pass ``` #### 2. Unused Interface Parameters **Severity:** Low (cosmetic) **Location:** `server/tts.py` lines 625-695 **Issue:** `VeniceKokoroTTS.generate_async()` accepts `voice_ref_path` and `emotion_exaggeration` but doesn't use them (reserved for ChatterboxTTS compatibility). **Recommendation:** Document this in docstring or add a comment explaining they're reserved for future use. #### 3. Hardcoded Configuration **Severity:** Low (configuration inconsistency) **Location:** `server/voice_ws.py` line 135 **Issue:** `agent_id="main"` is hardcoded, ignoring `OPENCLAW_AGENT_ID` from `.env`. **Recommendation:** Use environment variable: ```python agent_id = os.getenv("OPENCLAW_AGENT_ID", "main") self.openclaw = OpenClawClient( config=OpenClawConfig( base_url=openclaw_url, auth_token=openclaw_token, timeout=30.0, agent_id=agent_id, # ✅ Use env var ) ) ``` #### 4. Missing Error Handling in VoiceSession **Severity:** Low (prevents crash, but may hide errors) **Location:** `server/voice_ws.py` lines 177-202 **Issue:** `_transcribe_buffered_audio()` catches exceptions but only logs them, doesn't notify client. **Recommendation:** Send error notification to client via WebSocket: ```python await websocket.send_json({ "type": "error", "message": f"Transcription failed: {str(e)}" }) ``` ### Performance Considerations #### Sample Rate Processing - **STT:** 16kHz input → 16kHz output ✅ - **TTS:** 16kHz output ✅ - **No sample rate conversion needed** (Venice returns 16kHz) #### Memory Usage - Audio buffers stored in `bytearray` - `buffer_duration` tracks accumulated audio - Buffer cleared after transcription ✅ --- ## Format Summary ### Audio Formats | Component | Input Format | Output Format | Sample Rate | |-----------|--------------|---------------|-------------| | **Browser Mic** | PCM | Float32 | 16kHz | | **DeepgramSTT** | Float32 (16kHz) | JSON | 16kHz | | **OpenClaw** | String (text) | String (text) | N/A | | **VeniceKokoroTTS** | String (text) | Float32 PCM | 16kHz | | **Browser Speaker** | Float32 PCM | Float32 | 16kHz | ### Data Types - **Audio arrays:** `np.ndarray` (float32) - **STT response:** `TranscriptionResult` object - **TTS response:** `np.ndarray` (float32) - **OpenClaw response:** `str` (text) ### API Endpoints - **Deepgram:** `POST https://api.deepgram.com/v1/listen` (batch) - **OpenClaw Gateway:** `ws://` URL (JSON-RPC) - **Venice:** `POST https://api.venice.ai/api/v1/audio/speech` --- ## Testing Recommendations ### Unit Tests ```python # Test STT audio conversion def test_stt_float32_conversion(): audio = np.random.randn(16000).astype(np.float32) # 1 second at 16kHz result = stt.transcribe_async(audio) assert result.text is not None assert result.duration == 1.0 # Test TTS audio format def test_tts_returns_float32_pcm(): audio = tts.generate_async("Hello", voice_ref_path=None) assert audio.dtype == np.float32 assert len(audio.shape) == 1 # Mono # Sample rate is implicit (16kHz) ``` ### Integration Tests - Test full pipeline: Mic → STT → OpenClaw → TTS → Speaker - Test error handling: Invalid API keys, network failures - Test retry logic: OpenClaw timeout and retry - Test concurrent sessions: Multiple WebSocket connections ### Performance Tests - Measure latency: Mic → STT → Response → TTS - Measure RTF (Real-Time Factor): TTS generation time vs audio duration - Measure queue performance: Concurrent transcription requests --- ## Conclusion The voice pipeline is **functionally correct** with proper async handling and consistent data formats. The main improvement opportunities are: 1. Consider Deepgram streaming API for lower latency 2. Fix hardcoded `agent_id` to use environment variable 3. Document unused interface parameters 4. Add WebSocket error notifications to clients **Overall Status:** ✅ **WORKING** — No blocking issues. --- *Audit completed by Caroline ⚙️*