## Kani-TTS-2 Research - Evaluated Kani-TTS-2 as potential TTS upgrade (3-4x faster, RTF 0.2) - Documented benefits: zero-shot voice cloning, Apache 2.0 license, 3GB VRAM - Identified Windows compatibility issues (pynini compilation failures) - Created test script for future evaluation when Windows support improves ## RTX 5090 Critical Finding - Discovered RTX 5090 (Blackwell sm_120) not supported by PyTorch - Tested stable (2.6.0) and nightly (2.7.0.dev) - both lack sm_120 support - Documented impact: GPU acceleration unavailable for STT/TTS - Performance degradation: 3.5s target → 10-15s actual (CPU-only) ## Files Added - KANI_TTS_EVALUATION.md - Comprehensive Kani-TTS-2 analysis - RTX_5090_BLOCKER.md - GPU compatibility report with solutions - test_kani_tts.py - Benchmark script for future testing - fix_pytorch_cuda.bat - GPU setup script (for when support lands) ## Recommendations - Wait 1-3 months for PyTorch sm_120 support - Monitor PyTorch releases weekly - Alternative: Cloud GPU (RTX 4090) or different local GPU - Current: CPU-only mode functional but slow ## Next Steps - Monitor: https://github.com/pytorch/pytorch/releases - Test when available: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124 - Re-evaluate Kani-TTS-2 after GPU support Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
252 lines
6.7 KiB
Markdown
252 lines
6.7 KiB
Markdown
# Kani-TTS-2 Evaluation Report
|
|
|
|
**Date:** February 16, 2026
|
|
**System:** Windows 11, RTX 5090 (32GB VRAM)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**Status:** ❌ **Cannot test Kani-TTS-2 on Windows** (compilation issues)
|
|
|
|
Attempted installation of Kani-TTS-2 encountered critical dependency compilation errors on Windows. Additionally, current environment has PyTorch CPU-only installation despite having RTX 5090.
|
|
|
|
---
|
|
|
|
## Issues Discovered
|
|
|
|
### 1. PyTorch CPU-Only Installation
|
|
|
|
**Current Status:**
|
|
```
|
|
PyTorch: 2.10.0+cpu
|
|
CUDA available: False
|
|
CUDA version: N/A
|
|
```
|
|
|
|
**Impact:**
|
|
- Current TTS (Coqui XTTS v2) may not be using GPU acceleration
|
|
- Kani-TTS-2 requires CUDA-enabled PyTorch
|
|
- STT (faster-whisper) may not be using GPU acceleration
|
|
|
|
**Required:** PyTorch with CUDA 12.x support
|
|
|
|
### 2. Kani-TTS-2 Installation Failure
|
|
|
|
**Error:**
|
|
```
|
|
Failed building wheel for pynini
|
|
error: command 'cl.exe' failed with exit code 2
|
|
```
|
|
|
|
**Root Cause:**
|
|
- `nemo-toolkit` dependency requires `pynini`
|
|
- `pynini` compilation uses GCC/Clang flags (`-Wno-register`) incompatible with MSVC compiler
|
|
- No pre-built Windows wheels available for `pynini==2.1.6.post1`
|
|
|
|
**Dependency Chain:**
|
|
```
|
|
kani-tts-2 → nemo-toolkit[tts]==2.4.0 → pynini → [COMPILATION FAILED]
|
|
```
|
|
|
|
---
|
|
|
|
## Kani-TTS-2 Pros & Cons (Based on Documentation)
|
|
|
|
### Potential Benefits
|
|
|
|
✅ **3-4x faster generation** - RTF of 0.2 vs current 0.78
|
|
✅ **Zero-shot voice cloning** - No fine-tuning needed
|
|
✅ **Lower VRAM usage** - 3GB vs current 2-3GB (similar)
|
|
✅ **Simple API** - Clean Python interface
|
|
✅ **Commercial license** - Apache 2.0
|
|
✅ **Fast training** - 10k hours in 6 hours on 8x H100
|
|
|
|
### Challenges
|
|
|
|
❌ **Windows compatibility** - Compilation issues with dependencies
|
|
❌ **Requires nemo-toolkit** - Heavy dependency with C++ compilation
|
|
❌ **English-only** - Current version limited to English
|
|
❓ **Quality unknown** - Cannot test without successful installation
|
|
❓ **Streaming support** - Not documented, unclear if supported
|
|
|
|
---
|
|
|
|
## Alternative Solutions
|
|
|
|
### Option 1: Fix PyTorch CUDA Installation (Recommended)
|
|
|
|
**Goal:** Get current system using GPU properly + enable future testing
|
|
|
|
**Steps:**
|
|
1. Uninstall CPU PyTorch:
|
|
```bash
|
|
pip uninstall torch torchaudio torchvision
|
|
```
|
|
|
|
2. Install CUDA PyTorch:
|
|
```bash
|
|
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
|
|
```
|
|
|
|
3. Verify:
|
|
```python
|
|
import torch
|
|
print(torch.cuda.is_available()) # Should be True
|
|
print(torch.cuda.get_device_name(0)) # Should show RTX 5090
|
|
```
|
|
|
|
**Impact:**
|
|
- Current Coqui XTTS v2 will use GPU (faster)
|
|
- faster-whisper STT will use GPU (faster)
|
|
- Enables future Kani-TTS-2 testing
|
|
|
|
### Option 2: Use WSL2 or Docker (Linux Environment)
|
|
|
|
**Goal:** Run Kani-TTS-2 in Linux where dependencies compile properly
|
|
|
|
**Setup WSL2:**
|
|
```bash
|
|
# Install WSL2 with Ubuntu
|
|
wsl --install -d Ubuntu-24.04
|
|
|
|
# Install CUDA in WSL
|
|
# Follow: https://docs.nvidia.com/cuda/wsl-user-guide/
|
|
|
|
# Clone repo and test in WSL
|
|
cd /mnt/c/Users/kruz7/...
|
|
python test_kani_tts.py
|
|
```
|
|
|
|
**Pros:**
|
|
- Native Linux environment, better compatibility
|
|
- Access to Windows GPU via WSL-CUDA
|
|
- Can test Kani-TTS-2 properly
|
|
|
|
**Cons:**
|
|
- Additional setup complexity
|
|
- Need to manage two environments
|
|
|
|
### Option 3: Wait for Windows Support
|
|
|
|
**Goal:** Wait for Kani-TTS-2 to release Windows pre-built wheels
|
|
|
|
**Timeline:**
|
|
- Kani-TTS-2 is very new (Feb 2025)
|
|
- Windows wheels may be released in future versions
|
|
- Monitor: https://pypi.org/project/kani-tts-2/
|
|
|
|
**Meanwhile:**
|
|
- Stick with current Coqui XTTS v2
|
|
- Focus on other optimizations (query routing, caching, streaming)
|
|
|
|
### Option 4: Alternative TTS Engines
|
|
|
|
Consider other fast TTS options with better Windows support:
|
|
|
|
**A. Piper TTS**
|
|
- Very fast (RTF ~0.1)
|
|
- Lightweight, runs on CPU
|
|
- Pre-built Windows binaries
|
|
- Good quality
|
|
- Con: Limited voice cloning
|
|
|
|
**B. Bark**
|
|
- High quality
|
|
- Good voice cloning
|
|
- Con: Slower than current setup
|
|
|
|
**C. StyleTTS2**
|
|
- Excellent quality
|
|
- Zero-shot voice cloning
|
|
- Con: Slower, complex setup
|
|
|
|
---
|
|
|
|
## Recommendation
|
|
|
|
### Immediate Action: Fix PyTorch CUDA
|
|
|
|
**Priority: HIGH** - This affects current system performance
|
|
|
|
```bash
|
|
# From project root with venv activated
|
|
pip uninstall torch torchaudio torchvision -y
|
|
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
|
|
```
|
|
|
|
**Verify:**
|
|
```python
|
|
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
|
|
```
|
|
|
|
**Expected Improvement:**
|
|
- Current TTS latency: 1.63s → ~0.8-1.0s (using GPU)
|
|
- STT latency: 0.55s → ~0.3-0.4s (faster on GPU)
|
|
- Total: ~5.5s → ~4.0s (closer to 3.5s target)
|
|
|
|
### Kani-TTS-2 Strategy
|
|
|
|
**Short-term (Next Week):**
|
|
- Focus on optimizing current Coqui XTTS v2 with GPU
|
|
- Implement additional TTS caching
|
|
- Optimize streaming chunk size
|
|
|
|
**Medium-term (Next Month):**
|
|
- Monitor Kani-TTS-2 for Windows wheel releases
|
|
- Test in WSL2 if critical for evaluation
|
|
- Evaluate Piper TTS as alternative
|
|
|
|
**Long-term (Next Quarter):**
|
|
- Revisit Kani-TTS-2 when Windows support matures
|
|
- Consider migration to Linux host if TTS performance critical
|
|
|
|
---
|
|
|
|
## Current Performance Baseline
|
|
|
|
Based on README.md:
|
|
|
|
| Stage | Current | Target | Status |
|
|
|-------|---------|--------|--------|
|
|
| VAD silence detection | 800ms | 800ms | ✅ |
|
|
| STT (medium) | 550ms | 300ms | ⚠️ (CPU-only) |
|
|
| OpenClaw/LLM | 2470ms | 2000ms | ✅ |
|
|
| TTS first chunk | 1630ms | 300ms | ❌ (CPU-only?) |
|
|
| **Total** | **~5.5s** | **~3.5s** | ⚠️ |
|
|
|
|
**With GPU PyTorch (estimated):**
|
|
|
|
| Stage | With CUDA | Improvement |
|
|
|-------|-----------|-------------|
|
|
| STT | ~350ms | 1.6x faster |
|
|
| TTS | ~900ms | 1.8x faster |
|
|
| **Total** | **~4.0s** | **1.4x faster** |
|
|
|
|
Still short of 3.5s target, but closer. Kani-TTS-2 could bridge the gap if Windows support improves.
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ **Fix PyTorch CUDA** (see Option 1 above)
|
|
2. 🔄 **Re-benchmark current system** with GPU acceleration
|
|
3. 📊 **Measure actual improvement** in TTS latency
|
|
4. 🔍 **Evaluate if 4.0s total latency** is acceptable
|
|
5. 🕐 **Monitor Kani-TTS-2** for Windows support
|
|
6. 🧪 **Test Piper TTS** as lightweight alternative
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [Kani-TTS-2 GitHub](https://github.com/nineninesix-ai/kani-tts-2)
|
|
- [Kani-TTS-2 HuggingFace](https://huggingface.co/nineninesix/kani-tts-2-en)
|
|
- [PyTorch CUDA Installation](https://pytorch.org/get-started/locally/)
|
|
- [WSL CUDA Setup](https://docs.nvidia.com/cuda/wsl-user-guide/)
|
|
- [Piper TTS](https://github.com/rhasspy/piper)
|
|
- [StyleTTS2](https://github.com/yl4579/StyleTTS2)
|
|
|
|
---
|
|
|
|
**Conclusion:** Kani-TTS-2 shows promise (3-4x faster) but Windows compatibility issues prevent testing. **Immediate priority should be fixing PyTorch CUDA** to improve current system performance, then revisit Kani-TTS-2 when Windows support improves or via WSL2.
|