openclaw-voice/KANI_TTS_EVALUATION.md
MCKRUZ 2f17d4847d docs: Add Kani-TTS-2 evaluation and RTX 5090 compatibility analysis
## Kani-TTS-2 Research
- Evaluated Kani-TTS-2 as potential TTS upgrade (3-4x faster, RTF 0.2)
- Documented benefits: zero-shot voice cloning, Apache 2.0 license, 3GB VRAM
- Identified Windows compatibility issues (pynini compilation failures)
- Created test script for future evaluation when Windows support improves

## RTX 5090 Critical Finding
- Discovered RTX 5090 (Blackwell sm_120) not supported by PyTorch
- Tested stable (2.6.0) and nightly (2.7.0.dev) - both lack sm_120 support
- Documented impact: GPU acceleration unavailable for STT/TTS
- Performance degradation: 3.5s target → 10-15s actual (CPU-only)

## Files Added
- KANI_TTS_EVALUATION.md - Comprehensive Kani-TTS-2 analysis
- RTX_5090_BLOCKER.md - GPU compatibility report with solutions
- test_kani_tts.py - Benchmark script for future testing
- fix_pytorch_cuda.bat - GPU setup script (for when support lands)

## Recommendations
- Wait 1-3 months for PyTorch sm_120 support
- Monitor PyTorch releases weekly
- Alternative: Cloud GPU (RTX 4090) or different local GPU
- Current: CPU-only mode functional but slow

## Next Steps
- Monitor: https://github.com/pytorch/pytorch/releases
- Test when available: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
- Re-evaluate Kani-TTS-2 after GPU support

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-16 19:53:52 -05:00

6.7 KiB

Kani-TTS-2 Evaluation Report

Date: February 16, 2026 System: Windows 11, RTX 5090 (32GB VRAM)


Summary

Status: Cannot test Kani-TTS-2 on Windows (compilation issues)

Attempted installation of Kani-TTS-2 encountered critical dependency compilation errors on Windows. Additionally, current environment has PyTorch CPU-only installation despite having RTX 5090.


Issues Discovered

1. PyTorch CPU-Only Installation

Current Status:

PyTorch: 2.10.0+cpu
CUDA available: False
CUDA version: N/A

Impact:

  • Current TTS (Coqui XTTS v2) may not be using GPU acceleration
  • Kani-TTS-2 requires CUDA-enabled PyTorch
  • STT (faster-whisper) may not be using GPU acceleration

Required: PyTorch with CUDA 12.x support

2. Kani-TTS-2 Installation Failure

Error:

Failed building wheel for pynini
error: command 'cl.exe' failed with exit code 2

Root Cause:

  • nemo-toolkit dependency requires pynini
  • pynini compilation uses GCC/Clang flags (-Wno-register) incompatible with MSVC compiler
  • No pre-built Windows wheels available for pynini==2.1.6.post1

Dependency Chain:

kani-tts-2 → nemo-toolkit[tts]==2.4.0 → pynini → [COMPILATION FAILED]

Kani-TTS-2 Pros & Cons (Based on Documentation)

Potential Benefits

3-4x faster generation - RTF of 0.2 vs current 0.78 Zero-shot voice cloning - No fine-tuning needed Lower VRAM usage - 3GB vs current 2-3GB (similar) Simple API - Clean Python interface Commercial license - Apache 2.0 Fast training - 10k hours in 6 hours on 8x H100

Challenges

Windows compatibility - Compilation issues with dependencies Requires nemo-toolkit - Heavy dependency with C++ compilation English-only - Current version limited to English Quality unknown - Cannot test without successful installation Streaming support - Not documented, unclear if supported


Alternative Solutions

Goal: Get current system using GPU properly + enable future testing

Steps:

  1. Uninstall CPU PyTorch:

    pip uninstall torch torchaudio torchvision
    
  2. Install CUDA PyTorch:

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
    
  3. Verify:

    import torch
    print(torch.cuda.is_available())  # Should be True
    print(torch.cuda.get_device_name(0))  # Should show RTX 5090
    

Impact:

  • Current Coqui XTTS v2 will use GPU (faster)
  • faster-whisper STT will use GPU (faster)
  • Enables future Kani-TTS-2 testing

Option 2: Use WSL2 or Docker (Linux Environment)

Goal: Run Kani-TTS-2 in Linux where dependencies compile properly

Setup WSL2:

# Install WSL2 with Ubuntu
wsl --install -d Ubuntu-24.04

# Install CUDA in WSL
# Follow: https://docs.nvidia.com/cuda/wsl-user-guide/

# Clone repo and test in WSL
cd /mnt/c/Users/kruz7/...
python test_kani_tts.py

Pros:

  • Native Linux environment, better compatibility
  • Access to Windows GPU via WSL-CUDA
  • Can test Kani-TTS-2 properly

Cons:

  • Additional setup complexity
  • Need to manage two environments

Option 3: Wait for Windows Support

Goal: Wait for Kani-TTS-2 to release Windows pre-built wheels

Timeline:

Meanwhile:

  • Stick with current Coqui XTTS v2
  • Focus on other optimizations (query routing, caching, streaming)

Option 4: Alternative TTS Engines

Consider other fast TTS options with better Windows support:

A. Piper TTS

  • Very fast (RTF ~0.1)
  • Lightweight, runs on CPU
  • Pre-built Windows binaries
  • Good quality
  • Con: Limited voice cloning

B. Bark

  • High quality
  • Good voice cloning
  • Con: Slower than current setup

C. StyleTTS2

  • Excellent quality
  • Zero-shot voice cloning
  • Con: Slower, complex setup

Recommendation

Immediate Action: Fix PyTorch CUDA

Priority: HIGH - This affects current system performance

# From project root with venv activated
pip uninstall torch torchaudio torchvision -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verify:

python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Expected Improvement:

  • Current TTS latency: 1.63s → ~0.8-1.0s (using GPU)
  • STT latency: 0.55s → ~0.3-0.4s (faster on GPU)
  • Total: ~5.5s → ~4.0s (closer to 3.5s target)

Kani-TTS-2 Strategy

Short-term (Next Week):

  • Focus on optimizing current Coqui XTTS v2 with GPU
  • Implement additional TTS caching
  • Optimize streaming chunk size

Medium-term (Next Month):

  • Monitor Kani-TTS-2 for Windows wheel releases
  • Test in WSL2 if critical for evaluation
  • Evaluate Piper TTS as alternative

Long-term (Next Quarter):

  • Revisit Kani-TTS-2 when Windows support matures
  • Consider migration to Linux host if TTS performance critical

Current Performance Baseline

Based on README.md:

Stage Current Target Status
VAD silence detection 800ms 800ms
STT (medium) 550ms 300ms ⚠️ (CPU-only)
OpenClaw/LLM 2470ms 2000ms
TTS first chunk 1630ms 300ms (CPU-only?)
Total ~5.5s ~3.5s ⚠️

With GPU PyTorch (estimated):

Stage With CUDA Improvement
STT ~350ms 1.6x faster
TTS ~900ms 1.8x faster
Total ~4.0s 1.4x faster

Still short of 3.5s target, but closer. Kani-TTS-2 could bridge the gap if Windows support improves.


Next Steps

  1. Fix PyTorch CUDA (see Option 1 above)
  2. 🔄 Re-benchmark current system with GPU acceleration
  3. 📊 Measure actual improvement in TTS latency
  4. 🔍 Evaluate if 4.0s total latency is acceptable
  5. 🕐 Monitor Kani-TTS-2 for Windows support
  6. 🧪 Test Piper TTS as lightweight alternative

References


Conclusion: Kani-TTS-2 shows promise (3-4x faster) but Windows compatibility issues prevent testing. Immediate priority should be fixing PyTorch CUDA to improve current system performance, then revisit Kani-TTS-2 when Windows support improves or via WSL2.