Real-Time Streaming TTS Implementation

This project implements a high-performance, real-time streaming TTS system built on top of the Zonos-v0.1 model. Our implementation delivers low-latency speech synthesis with natural phrasing and intonation, optimized specifically for NVIDIA GH200 GPUs.

Key Features

Real-time streaming: Audio is generated and delivered in chunks for immediate playback
Ultra-low latency: First audio output in ~500ms, with continuous natural-sounding speech
Natural speech patterns: Smart text segmentation for conversational flow
Web interface: Interactive HTML/JS player with WebAudio API integration
Voice cloning: Use any voice sample for personalized speech synthesis
GH200 optimizations: Special optimizations for the NVIDIA Grace Hopper architecture

System Architecture

Streaming Server

The WebSocket-based streaming server (streaming_server.py) intelligently processes text inputs and delivers audio in optimized chunks:

class StreamingTTSSession:
    # Manages individual client connections with concurrent generation control
    MAX_CONCURRENT_GENERATIONS = 3
    generation_semaphore = asyncio.Semaphore(MAX_CONCURRENT_GENERATIONS)
    
    async def add_text(self, text):
        """Add text to the buffer and trigger generation if not already running"""
        self.text_buffer += text
        
        # Only start generation if we have enough complete text and not already generating
        if not self.is_generating and self.has_complete_sentence():
            self.is_generating = True
            self.generation_task = asyncio.create_task(self.generate_from_buffer())

The server implements:

Intelligent text chunking for natural phrase boundaries
Dynamic buffer management for smooth audio delivery
Advanced audio processing with crossfades between chunks
Efficient tensor-to-WAV conversion for streaming

Crossfade Implementation

One of the key features of our implementation is the advanced crossfade process that creates smooth transitions between audio chunks:

def crossfade_chunks(previous_chunk, current_chunk, overlap_samples=1024):
    """Apply crossfade between audio chunks to reduce discontinuities"""
    if previous_chunk is None:
        return current_chunk
    # Ensure chunks are on the same device
    device = current_chunk.device
    
    # Create crossfade weights
    fade_in = torch.linspace(0, 1, overlap_samples, device=device)
    fade_out = 1 - fade_in
    
    # Apply crossfade
    overlap_region = (previous_chunk[:, -overlap_samples:] * fade_out + 
                    current_chunk[:, :overlap_samples] * fade_in)
    
    # Combine with crossfade
    result = torch.cat([previous_chunk[:, :-overlap_samples], 
                     overlap_region, 
                     current_chunk[:, overlap_samples:]], dim=1)
    return result

This crossfade function ensures seamless transitions between audio chunks, eliminating the audible "pops" or discontinuities that can occur with naive concatenation.

WebSocket Handler

The WebSocket handler manages connections and data flow:

async def handle_websocket(websocket):
    try:
        session = StreamingTTSSession(websocket, "en-us")
        
        # Binary audio data sent as raw chunks
        await websocket.send(audio_bytes)
        
        # Metadata and control messages sent as JSON
        await websocket.send(json.dumps({"type": "metadata", "sampling_rate": 44100}))
    except websockets.exceptions.ConnectionClosed:
        print("Connection closed")

GH200 Optimizations

Our implementation includes specific optimizations for NVIDIA Grace Hopper (GH200) GPUs:

BFloat16 precision: Using BF16 for optimal performance on GH200
Tensor Cores utilization: TF32 math operations for matrix multiplications
CUDA graphs: Pre-compiled CUDA graphs for repetitive operations
Dynamic chunk scheduling: Optimized chunk sizes for GH200 memory architecture
Memory pre-allocation: Strategic memory allocation to reduce fragmentation
Stream-based processing: Parallel CUDA streams for audio processing and conversion
JIT compilation: Critical paths are JIT-compiled for maximum throughput

These optimizations result in significantly faster inference, lower latency, and better overall throughput on GH200 hardware.

Getting Started

Installation

Clone the repository:

git clone https://github.com/Zyphra/Zonos.git
cd Zonos

Choose your installation method:

Using Docker (recommended):

# For Gradio interface
docker compose up

# For development
docker build -t zonos .
docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t zonos
cd /Zonos
python sample.py # Generates a sample.wav in /Zonos

Manual installation:

# Install dependencies
pip install websockets torch torchaudio numpy kanjize phonemizer espeak-ng + a lot more that I don\'t remember

# Start the WebSocket server
python streaming_server.py

Starting the server:

# Start the server
python streaming_server.py

# Server will run on 0.0.0.0:8765 by default
# You should see output showing that the model is loading

Starting the client:

# Open tts_player.html in your browser
# Using a simple HTTP server if needed:
python -m http.server 8000
# Then navigate to http://localhost:8000/tts_player.html

Connect to the TTS server:
- In the web interface, click "Connect to TTS Server"
- The default connection string is "ws://localhost:8765"
- Start entering text or try the demo button
- You should see audio chunks arriving and playing in real-time

SSH Tunneling Setup

To connect to a remote server securely, use SSH tunneling:

# Basic SSH tunnel
ssh -N -L 8765:localhost:8765 user@remote-server

# More robust SSH tunnel with keep-alive settings
ssh -N -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60 -o ExitOnForwardFailure=yes

# With compression for lower bandwidth
ssh -N -C -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60

# Background process (add & at the end)
ssh -N -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60 &

This creates a secure tunnel where:

Local port 8765 forwards to the remote server's localhost:8765
-N prevents executing a remote command (tunnel only)
-L specifies the port forwarding
ServerAliveInterval keeps the connection active

After setting up the tunnel, connect to ws://localhost:8765 in the client interface to access the remote server.

File Transfer to Lambda Cloud

⚠️ IMPORTANT: YOU MUST TRANSFER FILES DIRECTLY ⚠️

scp -r /path/to/local/directory user@lambda-instance-ip:/path/on/remote/server

YOU NEED TO DO THIS. DIRECTLY CLONING THE REPO WILL NOT WORK. WHY THAT HAPPENS I DO NOT KNOW. BUT DO THIS AND DO NOT COME CRYING TO ME.

The SCP command will securely copy your local files to the Lambda Cloud instance. This step is absolutely critical as direct git cloning on Lambda Cloud instances has proven unreliable with this project.

LambdaLabs Deployment

Our system is specifically optimized for deployment on Lambda Cloud's GH200 instances:

Docker setup:

docker build -t zonos-streaming -f Dockerfile.streaming .
docker run -p 8765:8765 --gpus all zonos-streaming

LambdaLabs GPU configuration:
- Provision a GH200 GPU instance (24GB+ VRAM recommended)
- Configure CUDA_VISIBLE_DEVICES=0 for single-GPU training
- Set TORCH_CUDA_ARCH_LIST="8.9" for GH200 architecture
- Enable persistent storage for model and voice sample caching
- Configure automatic restart via systemd or Docker restart policies
Monitoring and optimization:
- Configure NVIDIA SMI monitoring: nvidia-smi dmon -s pucvmet -o TD
- Set memory fractions: torch.cuda.set_per_process_memory_fraction(0.98)
- Track tensor allocation: PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
- Performance metrics collection: RTF (real-time factor), latency, and throughput
- Automatic alerts via Lambda Cloud monitoring API

Model Architecture

Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
assets		assets
zonos		zonos
.gitignore		.gitignore
.python-version		.python-version
CONDITIONING_README.md		CONDITIONING_README.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
client.html		client.html
docker-compose.yml		docker-compose.yml
gradio_interface.py		gradio_interface.py
pyproject.toml		pyproject.toml
receive_tts.py		receive_tts.py
streaming_sample.py		streaming_sample.py
streaming_server.py		streaming_server.py
test_tts.py		test_tts.py
tts_player.html		tts_player.html
uv.lock		uv.lock
zonos_batch_cli.py		zonos_batch_cli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Streaming TTS Implementation

Key Features

System Architecture

Streaming Server

Crossfade Implementation

WebSocket Handler

GH200 Optimizations

Getting Started

Installation

SSH Tunneling Setup

File Transfer to Lambda Cloud

LambdaLabs Deployment

Model Architecture

About

Releases

Packages

Languages

License

Loandock/Zonos

Folders and files

Latest commit

History

Repository files navigation

Real-Time Streaming TTS Implementation

Key Features

System Architecture

Streaming Server

Crossfade Implementation

WebSocket Handler

GH200 Optimizations

Getting Started

Installation

SSH Tunneling Setup

File Transfer to Lambda Cloud

LambdaLabs Deployment

Model Architecture

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages