MCKRUZ 2f17d4847d docs: Add Kani-TTS-2 evaluation and RTX 5090 compatibility analysis

## Kani-TTS-2 Research
- Evaluated Kani-TTS-2 as potential TTS upgrade (3-4x faster, RTF 0.2)
- Documented benefits: zero-shot voice cloning, Apache 2.0 license, 3GB VRAM
- Identified Windows compatibility issues (pynini compilation failures)
- Created test script for future evaluation when Windows support improves

## RTX 5090 Critical Finding
- Discovered RTX 5090 (Blackwell sm_120) not supported by PyTorch
- Tested stable (2.6.0) and nightly (2.7.0.dev) - both lack sm_120 support
- Documented impact: GPU acceleration unavailable for STT/TTS
- Performance degradation: 3.5s target → 10-15s actual (CPU-only)

## Files Added
- KANI_TTS_EVALUATION.md - Comprehensive Kani-TTS-2 analysis
- RTX_5090_BLOCKER.md - GPU compatibility report with solutions
- test_kani_tts.py - Benchmark script for future testing
- fix_pytorch_cuda.bat - GPU setup script (for when support lands)

## Recommendations
- Wait 1-3 months for PyTorch sm_120 support
- Monitor PyTorch releases weekly
- Alternative: Cloud GPU (RTX 4090) or different local GPU
- Current: CPU-only mode functional but slow

## Next Steps
- Monitor: https://github.com/pytorch/pytorch/releases
- Test when available: pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
- Re-evaluate Kani-TTS-2 after GPU support

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-16 19:53:52 -05:00

6.7 KiB

Raw Blame History

Kani-TTS-2 Evaluation Report

Date: February 16, 2026 System: Windows 11, RTX 5090 (32GB VRAM)

Summary

Status: ❌ Cannot test Kani-TTS-2 on Windows (compilation issues)

Attempted installation of Kani-TTS-2 encountered critical dependency compilation errors on Windows. Additionally, current environment has PyTorch CPU-only installation despite having RTX 5090.

Issues Discovered

1. PyTorch CPU-Only Installation

Current Status:

PyTorch: 2.10.0+cpu
CUDA available: False
CUDA version: N/A

Impact:

Current TTS (Coqui XTTS v2) may not be using GPU acceleration
Kani-TTS-2 requires CUDA-enabled PyTorch
STT (faster-whisper) may not be using GPU acceleration

Required: PyTorch with CUDA 12.x support

2. Kani-TTS-2 Installation Failure

Error:

Failed building wheel for pynini
error: command 'cl.exe' failed with exit code 2

Root Cause:

nemo-toolkit dependency requires pynini
pynini compilation uses GCC/Clang flags (-Wno-register) incompatible with MSVC compiler
No pre-built Windows wheels available for pynini==2.1.6.post1

Dependency Chain:

kani-tts-2 → nemo-toolkit[tts]==2.4.0 → pynini → [COMPILATION FAILED]

Kani-TTS-2 Pros & Cons (Based on Documentation)

Potential Benefits

✅ 3-4x faster generation - RTF of 0.2 vs current 0.78 ✅ Zero-shot voice cloning - No fine-tuning needed ✅ Lower VRAM usage - 3GB vs current 2-3GB (similar) ✅ Simple API - Clean Python interface ✅ Commercial license - Apache 2.0 ✅ Fast training - 10k hours in 6 hours on 8x H100

Challenges

❌ Windows compatibility - Compilation issues with dependencies ❌ Requires nemo-toolkit - Heavy dependency with C++ compilation ❌ English-only - Current version limited to English ❓ Quality unknown - Cannot test without successful installation ❓ Streaming support - Not documented, unclear if supported

Alternative Solutions

Option 1: Fix PyTorch CUDA Installation (Recommended)

Goal: Get current system using GPU properly + enable future testing

Steps:

Uninstall CPU PyTorch:

pip uninstall torch torchaudio torchvision

Install CUDA PyTorch:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verify:

import torch
print(torch.cuda.is_available())  # Should be True
print(torch.cuda.get_device_name(0))  # Should show RTX 5090

Impact:

Current Coqui XTTS v2 will use GPU (faster)
faster-whisper STT will use GPU (faster)
Enables future Kani-TTS-2 testing

Option 2: Use WSL2 or Docker (Linux Environment)

Goal: Run Kani-TTS-2 in Linux where dependencies compile properly

Setup WSL2:

# Install WSL2 with Ubuntu
wsl --install -d Ubuntu-24.04

# Install CUDA in WSL
# Follow: https://docs.nvidia.com/cuda/wsl-user-guide/

# Clone repo and test in WSL
cd /mnt/c/Users/kruz7/...
python test_kani_tts.py

Pros:

Native Linux environment, better compatibility
Access to Windows GPU via WSL-CUDA
Can test Kani-TTS-2 properly

Cons:

Additional setup complexity
Need to manage two environments

Option 3: Wait for Windows Support

Goal: Wait for Kani-TTS-2 to release Windows pre-built wheels

Timeline:

Kani-TTS-2 is very new (Feb 2025)
Windows wheels may be released in future versions
Monitor: https://pypi.org/project/kani-tts-2/

Meanwhile:

Stick with current Coqui XTTS v2
Focus on other optimizations (query routing, caching, streaming)

Option 4: Alternative TTS Engines

Consider other fast TTS options with better Windows support:

A. Piper TTS

Very fast (RTF ~0.1)
Lightweight, runs on CPU
Pre-built Windows binaries
Good quality
Con: Limited voice cloning

B. Bark

High quality
Good voice cloning
Con: Slower than current setup

C. StyleTTS2

Excellent quality
Zero-shot voice cloning
Con: Slower, complex setup

Recommendation

Immediate Action: Fix PyTorch CUDA

Priority: HIGH - This affects current system performance

# From project root with venv activated
pip uninstall torch torchaudio torchvision -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Verify:

python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

Expected Improvement:

Current TTS latency: 1.63s → ~0.8-1.0s (using GPU)
STT latency: 0.55s → ~0.3-0.4s (faster on GPU)
Total: ~5.5s → ~4.0s (closer to 3.5s target)

Kani-TTS-2 Strategy

Short-term (Next Week):

Focus on optimizing current Coqui XTTS v2 with GPU
Implement additional TTS caching
Optimize streaming chunk size

Medium-term (Next Month):

Monitor Kani-TTS-2 for Windows wheel releases
Test in WSL2 if critical for evaluation
Evaluate Piper TTS as alternative

Long-term (Next Quarter):

Revisit Kani-TTS-2 when Windows support matures
Consider migration to Linux host if TTS performance critical

Current Performance Baseline

Based on README.md:

Stage	Current	Target	Status
VAD silence detection	800ms	800ms	✅
STT (medium)	550ms	300ms	⚠️ (CPU-only)
OpenClaw/LLM	2470ms	2000ms	✅
TTS first chunk	1630ms	300ms	❌ (CPU-only?)
Total	~5.5s	~3.5s	⚠️

With GPU PyTorch (estimated):

Stage	With CUDA	Improvement
STT	~350ms	1.6x faster
TTS	~900ms	1.8x faster
Total	~4.0s	1.4x faster

Still short of 3.5s target, but closer. Kani-TTS-2 could bridge the gap if Windows support improves.

Next Steps

✅ Fix PyTorch CUDA (see Option 1 above)
🔄 Re-benchmark current system with GPU acceleration
📊 Measure actual improvement in TTS latency
🔍 Evaluate if 4.0s total latency is acceptable
🕐 Monitor Kani-TTS-2 for Windows support
🧪 Test Piper TTS as lightweight alternative

References

Conclusion: Kani-TTS-2 shows promise (3-4x faster) but Windows compatibility issues prevent testing. Immediate priority should be fixing PyTorch CUDA to improve current system performance, then revisit Kani-TTS-2 when Windows support improves or via WSL2.

6.7 KiB Raw Blame History