Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming Support #49

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Streaming Support #49

wants to merge 7 commits into from

Conversation

uetuluk
Copy link

@uetuluk uetuluk commented Feb 12, 2025

This pull request introduces a new feature for streaming audio generation using the Zonos model. The changes include updates to the sample code, a new streaming sample script, and modifications to the Zonos model to support streaming.

New Feature: Streaming Audio Generation

  • sample.py: Updated the text prompt to "Hello, world! This is a test of streaming generation from Zonos." and assigned it to the text variable before creating the conditioning dictionary.

  • streaming-sample.py: Added a new script to demonstrate streaming audio generation. This script includes loading the model, preparing conditioning, and generating audio in chunks, which are then saved to a WAV file.

Model Enhancements

  • zonos/model.py: Added the stream method to the Zonos model for streaming audio generation. This method generates audio in chunks and yields them as they are produced.

Additional Imports

  • zonos/model.py: Added the Generator type from the typing module and the numpy library to support the new streaming functionality.

@uetuluk uetuluk mentioned this pull request Feb 12, 2025
@nestor-rod
Copy link

this is a great feature, are there plans to review it ?

@constan1
Copy link

constan1 commented Feb 14, 2025

This is great. Unfortunately, I can't test it due to having an older GPU (1080Ti).

Are there any plans to make this work with batches? maybe return (B, sampling_rate, audio_chunk). batching different audio responses from different text across different speakers? the model.prepare_conditioning would also have to be set up for it. Would it at least theoretically be feasible?

@uetuluk
Copy link
Author

uetuluk commented Feb 15, 2025

There is a random clicking voice or noise at the seams of the chunks, but cannot figure out the cause.

Parler-TTS streaming also uses DAC, so this approach should work.

I would be glad if someone smarter takes a look.

@constan1 That sounds like a good idea, I don't understand the code well enough yet to determine if it will work, but I can give it a try. Not sure if it will actually make an improvement to speed though, ie. you might be better off just running multiple instances at the same time.

@uetuluk
Copy link
Author

uetuluk commented Feb 15, 2025

Audio problem is still there after the updates from devs.

It seems batch is not supported in generate function yet.

@grocedj
Copy link

grocedj commented Feb 18, 2025

It sounds like a zero crossings issue

@SreevaatsavB
Copy link

SreevaatsavB commented Feb 21, 2025

I have tested this model for a few phrases and it takes around 2-4 secs to get an output wav for a sentence of length 10-15 words. (Tested on a RTX 4090). For smaller phrases, the latency was roughly <=1 sec

(Note:- These were the observations for the generation of audio with the streaming part)

Coming to the streaming, the clicking sounds are still present, but the main doubt is, if the model takes around 2-4 secs for a sentence input, then I think we can't ensure to provide a realtime experience, as there will be some latency for the 1st phrase's output. And then later on, the subsequent phrases can be processed and streamed to the client, but for each turn, there might be some latency, depending upon the sentence length.

Is there any way to solve this?

I tried by splitting the text into words, but the output was inconsistent, each chunk was with a different tone and didn't sound natural.

@uetuluk
Copy link
Author

uetuluk commented Feb 21, 2025

Did you try my streaming-sample.py?

You need to use something like gradio to actually see the streaming happening.

@SreevaatsavB
Copy link

SreevaatsavB commented Feb 21, 2025

Yeah, I did use the steaming_sample.py, for smaller text inputs, it worked fine, but for some long texts there were some clicking sounds in between, and took some time for the first audio output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants