This project implements a high-performance, real-time streaming TTS system built on top of the Zonos-v0.1 model. Our implementation delivers low-latency speech synthesis with natural phrasing and intonation, optimized specifically for NVIDIA GH200 GPUs.
- Real-time streaming: Audio is generated and delivered in chunks for immediate playback
- Ultra-low latency: First audio output in ~500ms, with continuous natural-sounding speech
- Natural speech patterns: Smart text segmentation for conversational flow
- Web interface: Interactive HTML/JS player with WebAudio API integration
- Voice cloning: Use any voice sample for personalized speech synthesis
- GH200 optimizations: Special optimizations for the NVIDIA Grace Hopper architecture
The WebSocket-based streaming server (streaming_server.py
) intelligently processes text inputs and delivers audio in optimized chunks:
class StreamingTTSSession:
# Manages individual client connections with concurrent generation control
MAX_CONCURRENT_GENERATIONS = 3
generation_semaphore = asyncio.Semaphore(MAX_CONCURRENT_GENERATIONS)
async def add_text(self, text):
"""Add text to the buffer and trigger generation if not already running"""
self.text_buffer += text
# Only start generation if we have enough complete text and not already generating
if not self.is_generating and self.has_complete_sentence():
self.is_generating = True
self.generation_task = asyncio.create_task(self.generate_from_buffer())
The server implements:
- Intelligent text chunking for natural phrase boundaries
- Dynamic buffer management for smooth audio delivery
- Advanced audio processing with crossfades between chunks
- Efficient tensor-to-WAV conversion for streaming
One of the key features of our implementation is the advanced crossfade process that creates smooth transitions between audio chunks:
def crossfade_chunks(previous_chunk, current_chunk, overlap_samples=1024):
"""Apply crossfade between audio chunks to reduce discontinuities"""
if previous_chunk is None:
return current_chunk
# Ensure chunks are on the same device
device = current_chunk.device
# Create crossfade weights
fade_in = torch.linspace(0, 1, overlap_samples, device=device)
fade_out = 1 - fade_in
# Apply crossfade
overlap_region = (previous_chunk[:, -overlap_samples:] * fade_out +
current_chunk[:, :overlap_samples] * fade_in)
# Combine with crossfade
result = torch.cat([previous_chunk[:, :-overlap_samples],
overlap_region,
current_chunk[:, overlap_samples:]], dim=1)
return result
This crossfade function ensures seamless transitions between audio chunks, eliminating the audible "pops" or discontinuities that can occur with naive concatenation.
The WebSocket handler manages connections and data flow:
async def handle_websocket(websocket):
try:
session = StreamingTTSSession(websocket, "en-us")
# Binary audio data sent as raw chunks
await websocket.send(audio_bytes)
# Metadata and control messages sent as JSON
await websocket.send(json.dumps({"type": "metadata", "sampling_rate": 44100}))
except websockets.exceptions.ConnectionClosed:
print("Connection closed")
Our implementation includes specific optimizations for NVIDIA Grace Hopper (GH200) GPUs:
- BFloat16 precision: Using BF16 for optimal performance on GH200
- Tensor Cores utilization: TF32 math operations for matrix multiplications
- CUDA graphs: Pre-compiled CUDA graphs for repetitive operations
- Dynamic chunk scheduling: Optimized chunk sizes for GH200 memory architecture
- Memory pre-allocation: Strategic memory allocation to reduce fragmentation
- Stream-based processing: Parallel CUDA streams for audio processing and conversion
- JIT compilation: Critical paths are JIT-compiled for maximum throughput
These optimizations result in significantly faster inference, lower latency, and better overall throughput on GH200 hardware.
-
Clone the repository:
git clone https://github.com/Zyphra/Zonos.git cd Zonos
-
Choose your installation method:
Using Docker (recommended):
# For Gradio interface docker compose up # For development docker build -t zonos . docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t zonos cd /Zonos python sample.py # Generates a sample.wav in /Zonos
Manual installation:
# Install dependencies pip install websockets torch torchaudio numpy kanjize phonemizer espeak-ng + a lot more that I don\'t remember # Start the WebSocket server python streaming_server.py
-
Starting the server:
# Start the server python streaming_server.py # Server will run on 0.0.0.0:8765 by default # You should see output showing that the model is loading
-
Starting the client:
# Open tts_player.html in your browser # Using a simple HTTP server if needed: python -m http.server 8000 # Then navigate to http://localhost:8000/tts_player.html
-
Connect to the TTS server:
- In the web interface, click "Connect to TTS Server"
- The default connection string is "ws://localhost:8765"
- Start entering text or try the demo button
- You should see audio chunks arriving and playing in real-time
To connect to a remote server securely, use SSH tunneling:
# Basic SSH tunnel
ssh -N -L 8765:localhost:8765 user@remote-server
# More robust SSH tunnel with keep-alive settings
ssh -N -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60 -o ExitOnForwardFailure=yes
# With compression for lower bandwidth
ssh -N -C -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60
# Background process (add & at the end)
ssh -N -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60 &
This creates a secure tunnel where:
- Local port 8765 forwards to the remote server's localhost:8765
-N
prevents executing a remote command (tunnel only)-L
specifies the port forwarding- ServerAliveInterval keeps the connection active
After setting up the tunnel, connect to ws://localhost:8765
in the client interface to access the remote server.
scp -r /path/to/local/directory user@lambda-instance-ip:/path/on/remote/server
YOU NEED TO DO THIS. DIRECTLY CLONING THE REPO WILL NOT WORK. WHY THAT HAPPENS I DO NOT KNOW. BUT DO THIS AND DO NOT COME CRYING TO ME.
The SCP command will securely copy your local files to the Lambda Cloud instance. This step is absolutely critical as direct git cloning on Lambda Cloud instances has proven unreliable with this project.
Our system is specifically optimized for deployment on Lambda Cloud's GH200 instances:
-
Docker setup:
docker build -t zonos-streaming -f Dockerfile.streaming . docker run -p 8765:8765 --gpus all zonos-streaming
-
LambdaLabs GPU configuration:
- Provision a GH200 GPU instance (24GB+ VRAM recommended)
- Configure
CUDA_VISIBLE_DEVICES=0
for single-GPU training - Set
TORCH_CUDA_ARCH_LIST="8.9"
for GH200 architecture - Enable persistent storage for model and voice sample caching
- Configure automatic restart via systemd or Docker restart policies
-
Monitoring and optimization:
- Configure NVIDIA SMI monitoring:
nvidia-smi dmon -s pucvmet -o TD
- Set memory fractions:
torch.cuda.set_per_process_memory_fraction(0.98)
- Track tensor allocation:
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
- Performance metrics collection: RTF (real-time factor), latency, and throughput
- Automatic alerts via Lambda Cloud monitoring API
- Configure NVIDIA SMI monitoring:
Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone.