## Kani-TTS-2 Research - Evaluated Kani-TTS-2 as potential TTS upgrade (3-4x faster, RTF 0.2) - Documented benefits: zero-shot voice cloning, Apache 2.0 license, 3GB VRAM - Identified Windows compatibility issues (pynini compilation failures) - Created test script for future evaluation when Windows support improves ## RTX 5090 Critical Finding - Discovered RTX 5090 (Blackwell sm_120) not supported by PyTorch - Tested stable (2.6.0) and nightly (2.7.0.dev) - both lack sm_120 support - Documented impact: GPU acceleration unavailable for STT/TTS - Performance degradation: 3.5s target → 10-15s actual (CPU-only) ## Files Added - KANI_TTS_EVALUATION.md - Comprehensive Kani-TTS-2 analysis - RTX_5090_BLOCKER.md - GPU compatibility report with solutions - test_kani_tts.py - Benchmark script for future testing - fix_pytorch_cuda.bat - GPU setup script (for when support lands) ## Recommendations - Wait 1-3 months for PyTorch sm_120 support - Monitor PyTorch releases weekly - Alternative: Cloud GPU (RTX 4090) or different local GPU - Current: CPU-only mode functional but slow ## Next Steps - Monitor: https://github.com/pytorch/pytorch/releases - Test when available: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 - Re-evaluate Kani-TTS-2 after GPU support Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
6.7 KiB
Kani-TTS-2 Evaluation Report
Date: February 16, 2026 System: Windows 11, RTX 5090 (32GB VRAM)
Summary
Status: ❌ Cannot test Kani-TTS-2 on Windows (compilation issues)
Attempted installation of Kani-TTS-2 encountered critical dependency compilation errors on Windows. Additionally, current environment has PyTorch CPU-only installation despite having RTX 5090.
Issues Discovered
1. PyTorch CPU-Only Installation
Current Status:
PyTorch: 2.10.0+cpu
CUDA available: False
CUDA version: N/A
Impact:
- Current TTS (Coqui XTTS v2) may not be using GPU acceleration
- Kani-TTS-2 requires CUDA-enabled PyTorch
- STT (faster-whisper) may not be using GPU acceleration
Required: PyTorch with CUDA 12.x support
2. Kani-TTS-2 Installation Failure
Error:
Failed building wheel for pynini
error: command 'cl.exe' failed with exit code 2
Root Cause:
nemo-toolkitdependency requirespyninipyninicompilation uses GCC/Clang flags (-Wno-register) incompatible with MSVC compiler- No pre-built Windows wheels available for
pynini==2.1.6.post1
Dependency Chain:
kani-tts-2 → nemo-toolkit[tts]==2.4.0 → pynini → [COMPILATION FAILED]
Kani-TTS-2 Pros & Cons (Based on Documentation)
Potential Benefits
✅ 3-4x faster generation - RTF of 0.2 vs current 0.78 ✅ Zero-shot voice cloning - No fine-tuning needed ✅ Lower VRAM usage - 3GB vs current 2-3GB (similar) ✅ Simple API - Clean Python interface ✅ Commercial license - Apache 2.0 ✅ Fast training - 10k hours in 6 hours on 8x H100
Challenges
❌ Windows compatibility - Compilation issues with dependencies ❌ Requires nemo-toolkit - Heavy dependency with C++ compilation ❌ English-only - Current version limited to English ❓ Quality unknown - Cannot test without successful installation ❓ Streaming support - Not documented, unclear if supported
Alternative Solutions
Option 1: Fix PyTorch CUDA Installation (Recommended)
Goal: Get current system using GPU properly + enable future testing
Steps:
-
Uninstall CPU PyTorch:
pip uninstall torch torchaudio torchvision -
Install CUDA PyTorch:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 -
Verify:
import torch print(torch.cuda.is_available()) # Should be True print(torch.cuda.get_device_name(0)) # Should show RTX 5090
Impact:
- Current Coqui XTTS v2 will use GPU (faster)
- faster-whisper STT will use GPU (faster)
- Enables future Kani-TTS-2 testing
Option 2: Use WSL2 or Docker (Linux Environment)
Goal: Run Kani-TTS-2 in Linux where dependencies compile properly
Setup WSL2:
# Install WSL2 with Ubuntu
wsl --install -d Ubuntu-24.04
# Install CUDA in WSL
# Follow: https://docs.nvidia.com/cuda/wsl-user-guide/
# Clone repo and test in WSL
cd /mnt/c/Users/kruz7/...
python test_kani_tts.py
Pros:
- Native Linux environment, better compatibility
- Access to Windows GPU via WSL-CUDA
- Can test Kani-TTS-2 properly
Cons:
- Additional setup complexity
- Need to manage two environments
Option 3: Wait for Windows Support
Goal: Wait for Kani-TTS-2 to release Windows pre-built wheels
Timeline:
- Kani-TTS-2 is very new (Feb 2025)
- Windows wheels may be released in future versions
- Monitor: https://pypi.org/project/kani-tts-2/
Meanwhile:
- Stick with current Coqui XTTS v2
- Focus on other optimizations (query routing, caching, streaming)
Option 4: Alternative TTS Engines
Consider other fast TTS options with better Windows support:
A. Piper TTS
- Very fast (RTF ~0.1)
- Lightweight, runs on CPU
- Pre-built Windows binaries
- Good quality
- Con: Limited voice cloning
B. Bark
- High quality
- Good voice cloning
- Con: Slower than current setup
C. StyleTTS2
- Excellent quality
- Zero-shot voice cloning
- Con: Slower, complex setup
Recommendation
Immediate Action: Fix PyTorch CUDA
Priority: HIGH - This affects current system performance
# From project root with venv activated
pip uninstall torch torchaudio torchvision -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Verify:
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
Expected Improvement:
- Current TTS latency: 1.63s → ~0.8-1.0s (using GPU)
- STT latency: 0.55s → ~0.3-0.4s (faster on GPU)
- Total: ~5.5s → ~4.0s (closer to 3.5s target)
Kani-TTS-2 Strategy
Short-term (Next Week):
- Focus on optimizing current Coqui XTTS v2 with GPU
- Implement additional TTS caching
- Optimize streaming chunk size
Medium-term (Next Month):
- Monitor Kani-TTS-2 for Windows wheel releases
- Test in WSL2 if critical for evaluation
- Evaluate Piper TTS as alternative
Long-term (Next Quarter):
- Revisit Kani-TTS-2 when Windows support matures
- Consider migration to Linux host if TTS performance critical
Current Performance Baseline
Based on README.md:
| Stage | Current | Target | Status |
|---|---|---|---|
| VAD silence detection | 800ms | 800ms | ✅ |
| STT (medium) | 550ms | 300ms | ⚠️ (CPU-only) |
| OpenClaw/LLM | 2470ms | 2000ms | ✅ |
| TTS first chunk | 1630ms | 300ms | ❌ (CPU-only?) |
| Total | ~5.5s | ~3.5s | ⚠️ |
With GPU PyTorch (estimated):
| Stage | With CUDA | Improvement |
|---|---|---|
| STT | ~350ms | 1.6x faster |
| TTS | ~900ms | 1.8x faster |
| Total | ~4.0s | 1.4x faster |
Still short of 3.5s target, but closer. Kani-TTS-2 could bridge the gap if Windows support improves.
Next Steps
- ✅ Fix PyTorch CUDA (see Option 1 above)
- 🔄 Re-benchmark current system with GPU acceleration
- 📊 Measure actual improvement in TTS latency
- 🔍 Evaluate if 4.0s total latency is acceptable
- 🕐 Monitor Kani-TTS-2 for Windows support
- 🧪 Test Piper TTS as lightweight alternative
References
- Kani-TTS-2 GitHub
- Kani-TTS-2 HuggingFace
- PyTorch CUDA Installation
- WSL CUDA Setup
- Piper TTS
- StyleTTS2
Conclusion: Kani-TTS-2 shows promise (3-4x faster) but Windows compatibility issues prevent testing. Immediate priority should be fixing PyTorch CUDA to improve current system performance, then revisit Kani-TTS-2 when Windows support improves or via WSL2.