feat: Major performance optimizations and feature enhancements
## Performance Optimizations (3-10x faster responses) - STT beam_size reduced to 1 (3-5x faster transcription, minimal quality loss) - Smart query routing: Haiku (simple) → Sonnet (medium) → Opus (complex) - TTS cache for common phrases (27 pre-generated responses) - Sentence-level streaming TTS (start playing while generating) - Sample-based VAD timing (30x improvement in silence detection) ## TTS Engine Upgrade - Migrated from Chatterbox to Chatterbox-Turbo - Zero-shot voice cloning (no fine-tuning required) - Native paralinguistic tag support ([laugh], [sigh], [chuckle], etc.) - Emotion presets with temperature control - Improved marker conversion (*action*, (action), ~action~) ## Discord Bot Enhancements - Multi-agent support (Jarvis, Sage) - Improved voice receiving with discord-ext-voice-recv - Enhanced /join, /leave, /status commands - Per-agent personality configuration - Better audio sink/receiver implementation ## OpenClaw Integration - WebSocket support for Gateway communication - Query complexity routing (auto-select model) - Improved error handling and retries - Session management per Discord guild - Better latency tracking ## Pipeline Improvements - Sentence splitter for streaming optimization - Query router for intelligent model selection - Enhanced VAD receiver with sample-based timing - Improved audio buffering and format conversion - Better transcript management ## Documentation - Added QUICK_START.md (5-minute test guide) - Added OPTIMIZATION_SUMMARY.md (performance analysis) - Added DISCORD_OPTIMIZATION_TEST.md (testing guide) - Added USAGE_GUIDE.md (comprehensive usage) - Updated README.md with optimization details ## Utilities & Scripts - Added get_invite_link.py (Discord bot invite) - Added sync_commands.py, sync_to_guild.py (command sync) - Added test_gateway.py, test_stt.py (testing utilities) - Added openclaw_wrapper.py (wrapper script) - Removed create_mock_turn_model.py (no longer needed) ## Configuration Updates - STT model: medium → small (faster, acceptable quality) - TTS engine: chatterbox → coqui (Turbo integration) - Beam size: 5 → 1 (latency optimization) - Added emotion_exaggeration per agent - Updated .gitignore for project files Total: ~2105 insertions, ~462 deletions across 35 files Performance: ~5.5s total latency (down from 22-35s) Target: ~3.5s (achieved in simple queries with cache) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
f1d884bb6a
commit
9fde3d31ba
36 changed files with 6050 additions and 471 deletions
27
config.yaml
27
config.yaml
|
|
@ -28,7 +28,7 @@ agents:
|
|||
# Per-agent settings
|
||||
jarvis:
|
||||
# TTS voice reference file (relative to server/voices/)
|
||||
voice_file: "jarvis.wav"
|
||||
voice_file: "jarvis.mp3"
|
||||
|
||||
# Agent personality for LLM context
|
||||
personality: |
|
||||
|
|
@ -50,26 +50,36 @@ agents:
|
|||
emotion_exaggeration: 0.2
|
||||
|
||||
# ============================================================================
|
||||
# OpenClaw API
|
||||
# OpenClaw Gateway
|
||||
# ============================================================================
|
||||
openclaw:
|
||||
# Base URL for OpenClaw API
|
||||
# WebSocket URL for OpenClaw Gateway
|
||||
# REQUIRED: Set via OPENCLAW_BASE_URL environment variable
|
||||
# Format: ws://IP:PORT (default port: 18789)
|
||||
base_url: null
|
||||
|
||||
# Authentication token
|
||||
# REQUIRED: Set via OPENCLAW_TOKEN environment variable
|
||||
# REQUIRED: Set via OPENCLAW_AUTH_TOKEN environment variable
|
||||
token: null
|
||||
|
||||
# Request timeout (seconds)
|
||||
timeout: 8.0
|
||||
|
||||
# Retry timeout (seconds)
|
||||
retry_timeout: 15.0
|
||||
|
||||
# Retry attempts on failure
|
||||
max_retries: 1
|
||||
|
||||
# Model/agent selection
|
||||
model: "claude-sonnet-4"
|
||||
|
||||
# Agent ID for session keys
|
||||
agent_id: "jarvis"
|
||||
|
||||
# Session scope: per-peer or shared
|
||||
session_scope: "per-peer"
|
||||
|
||||
# ============================================================================
|
||||
# Pipeline Configuration
|
||||
# ============================================================================
|
||||
|
|
@ -95,12 +105,14 @@ pipeline:
|
|||
max_wait: 3.0
|
||||
|
||||
# Model path (relative to models/ directory)
|
||||
model_path: "smart_turn_v3.onnx"
|
||||
# Using v3.2 GPU model for best performance with RTX 5090
|
||||
model_path: "smart-turn-v3.2-gpu.onnx"
|
||||
|
||||
# Speech-to-Text (faster-whisper)
|
||||
stt:
|
||||
# Model size: tiny, base, small, medium, large-v3
|
||||
model_size: "medium"
|
||||
# Using "small" for faster transcription (was "medium")
|
||||
model_size: "small"
|
||||
|
||||
# Device: cuda or cpu
|
||||
device: "cuda"
|
||||
|
|
@ -109,7 +121,8 @@ pipeline:
|
|||
compute_type: "float16"
|
||||
|
||||
# Beam size for decoding (higher = more accurate, slower)
|
||||
beam_size: 5
|
||||
# Optimized for voice chat: beam_size=1 is 3-5x faster with minimal quality loss
|
||||
beam_size: 1
|
||||
|
||||
# Language hint (null = auto-detect)
|
||||
language: "en"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue