Skip to content
/ Zonos Public
forked from Zyphra/Zonos

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

License

Notifications You must be signed in to change notification settings

Loandock/Zonos

 
 

Repository files navigation

Real-Time Streaming TTS Implementation

This project implements a high-performance, real-time streaming TTS system built on top of the Zonos-v0.1 model. Our implementation delivers low-latency speech synthesis with natural phrasing and intonation, optimized specifically for NVIDIA GH200 GPUs.

Key Features

  • Real-time streaming: Audio is generated and delivered in chunks for immediate playback
  • Ultra-low latency: First audio output in ~500ms, with continuous natural-sounding speech
  • Natural speech patterns: Smart text segmentation for conversational flow
  • Web interface: Interactive HTML/JS player with WebAudio API integration
  • Voice cloning: Use any voice sample for personalized speech synthesis
  • GH200 optimizations: Special optimizations for the NVIDIA Grace Hopper architecture

System Architecture

Streaming Server

The WebSocket-based streaming server (streaming_server.py) intelligently processes text inputs and delivers audio in optimized chunks:

class StreamingTTSSession:
    # Manages individual client connections with concurrent generation control
    MAX_CONCURRENT_GENERATIONS = 3
    generation_semaphore = asyncio.Semaphore(MAX_CONCURRENT_GENERATIONS)
    
    async def add_text(self, text):
        """Add text to the buffer and trigger generation if not already running"""
        self.text_buffer += text
        
        # Only start generation if we have enough complete text and not already generating
        if not self.is_generating and self.has_complete_sentence():
            self.is_generating = True
            self.generation_task = asyncio.create_task(self.generate_from_buffer())

The server implements:

  • Intelligent text chunking for natural phrase boundaries
  • Dynamic buffer management for smooth audio delivery
  • Advanced audio processing with crossfades between chunks
  • Efficient tensor-to-WAV conversion for streaming

Crossfade Implementation

One of the key features of our implementation is the advanced crossfade process that creates smooth transitions between audio chunks:

def crossfade_chunks(previous_chunk, current_chunk, overlap_samples=1024):
    """Apply crossfade between audio chunks to reduce discontinuities"""
    if previous_chunk is None:
        return current_chunk
    # Ensure chunks are on the same device
    device = current_chunk.device
    
    # Create crossfade weights
    fade_in = torch.linspace(0, 1, overlap_samples, device=device)
    fade_out = 1 - fade_in
    
    # Apply crossfade
    overlap_region = (previous_chunk[:, -overlap_samples:] * fade_out + 
                    current_chunk[:, :overlap_samples] * fade_in)
    
    # Combine with crossfade
    result = torch.cat([previous_chunk[:, :-overlap_samples], 
                     overlap_region, 
                     current_chunk[:, overlap_samples:]], dim=1)
    return result

This crossfade function ensures seamless transitions between audio chunks, eliminating the audible "pops" or discontinuities that can occur with naive concatenation.

WebSocket Handler

The WebSocket handler manages connections and data flow:

async def handle_websocket(websocket):
    try:
        session = StreamingTTSSession(websocket, "en-us")
        
        # Binary audio data sent as raw chunks
        await websocket.send(audio_bytes)
        
        # Metadata and control messages sent as JSON
        await websocket.send(json.dumps({"type": "metadata", "sampling_rate": 44100}))
    except websockets.exceptions.ConnectionClosed:
        print("Connection closed")

GH200 Optimizations

Our implementation includes specific optimizations for NVIDIA Grace Hopper (GH200) GPUs:

  • BFloat16 precision: Using BF16 for optimal performance on GH200
  • Tensor Cores utilization: TF32 math operations for matrix multiplications
  • CUDA graphs: Pre-compiled CUDA graphs for repetitive operations
  • Dynamic chunk scheduling: Optimized chunk sizes for GH200 memory architecture
  • Memory pre-allocation: Strategic memory allocation to reduce fragmentation
  • Stream-based processing: Parallel CUDA streams for audio processing and conversion
  • JIT compilation: Critical paths are JIT-compiled for maximum throughput

These optimizations result in significantly faster inference, lower latency, and better overall throughput on GH200 hardware.

Getting Started

Installation

  1. Clone the repository:

    git clone https://github.com/Zyphra/Zonos.git
    cd Zonos
  2. Choose your installation method:

    Using Docker (recommended):

    # For Gradio interface
    docker compose up
    
    # For development
    docker build -t zonos .
    docker run -it --gpus=all --net=host -v /path/to/Zonos:/Zonos -t zonos
    cd /Zonos
    python sample.py # Generates a sample.wav in /Zonos

    Manual installation:

    # Install dependencies
    pip install websockets torch torchaudio numpy kanjize phonemizer espeak-ng + a lot more that I don\'t remember
    
    # Start the WebSocket server
    python streaming_server.py
  3. Starting the server:

    # Start the server
    python streaming_server.py
    
    # Server will run on 0.0.0.0:8765 by default
    # You should see output showing that the model is loading
  4. Starting the client:

    # Open tts_player.html in your browser
    # Using a simple HTTP server if needed:
    python -m http.server 8000
    # Then navigate to http://localhost:8000/tts_player.html
  5. Connect to the TTS server:

    • In the web interface, click "Connect to TTS Server"
    • The default connection string is "ws://localhost:8765"
    • Start entering text or try the demo button
    • You should see audio chunks arriving and playing in real-time

SSH Tunneling Setup

To connect to a remote server securely, use SSH tunneling:

# Basic SSH tunnel
ssh -N -L 8765:localhost:8765 user@remote-server

# More robust SSH tunnel with keep-alive settings
ssh -N -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60 -o ExitOnForwardFailure=yes

# With compression for lower bandwidth
ssh -N -C -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60

# Background process (add & at the end)
ssh -N -L 8765:localhost:8765 user@remote-server -o ServerAliveInterval=60 &

This creates a secure tunnel where:

  • Local port 8765 forwards to the remote server's localhost:8765
  • -N prevents executing a remote command (tunnel only)
  • -L specifies the port forwarding
  • ServerAliveInterval keeps the connection active

After setting up the tunnel, connect to ws://localhost:8765 in the client interface to access the remote server.

File Transfer to Lambda Cloud

⚠️ IMPORTANT: YOU MUST TRANSFER FILES DIRECTLY ⚠️

scp -r /path/to/local/directory user@lambda-instance-ip:/path/on/remote/server

YOU NEED TO DO THIS. DIRECTLY CLONING THE REPO WILL NOT WORK. WHY THAT HAPPENS I DO NOT KNOW. BUT DO THIS AND DO NOT COME CRYING TO ME.

The SCP command will securely copy your local files to the Lambda Cloud instance. This step is absolutely critical as direct git cloning on Lambda Cloud instances has proven unreliable with this project.

LambdaLabs Deployment

Our system is specifically optimized for deployment on Lambda Cloud's GH200 instances:

  1. Docker setup:

    docker build -t zonos-streaming -f Dockerfile.streaming .
    docker run -p 8765:8765 --gpus all zonos-streaming
  2. LambdaLabs GPU configuration:

    • Provision a GH200 GPU instance (24GB+ VRAM recommended)
    • Configure CUDA_VISIBLE_DEVICES=0 for single-GPU training
    • Set TORCH_CUDA_ARCH_LIST="8.9" for GH200 architecture
    • Enable persistent storage for model and voice sample caching
    • Configure automatic restart via systemd or Docker restart policies
  3. Monitoring and optimization:

    • Configure NVIDIA SMI monitoring: nvidia-smi dmon -s pucvmet -o TD
    • Set memory fractions: torch.cuda.set_per_process_memory_fraction(0.98)
    • Track tensor allocation: PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
    • Performance metrics collection: RTF (real-time factor), latency, and throughput
    • Automatic alerts via Lambda Cloud monitoring API

Model Architecture

Zonos follows a straightforward architecture: text normalization and phonemization via eSpeak, followed by DAC token prediction through a transformer or hybrid backbone.

Zonos Architecture Diagram

About

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 82.1%
  • HTML 17.7%
  • Dockerfile 0.2%