-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming Support #49
base: main
Are you sure you want to change the base?
Conversation
this is a great feature, are there plans to review it ? |
This is great. Unfortunately, I can't test it due to having an older GPU (1080Ti). Are there any plans to make this work with batches? maybe return (B, sampling_rate, audio_chunk). batching different audio responses from different text across different speakers? the model.prepare_conditioning would also have to be set up for it. Would it at least theoretically be feasible? |
There is a random clicking voice or noise at the seams of the chunks, but cannot figure out the cause. Parler-TTS streaming also uses DAC, so this approach should work. I would be glad if someone smarter takes a look. @constan1 That sounds like a good idea, I don't understand the code well enough yet to determine if it will work, but I can give it a try. Not sure if it will actually make an improvement to speed though, ie. you might be better off just running multiple instances at the same time. |
Audio problem is still there after the updates from devs. It seems batch is not supported in |
It sounds like a zero crossings issue |
I have tested this model for a few phrases and it takes around 2-4 secs to get an output wav for a sentence of length 10-15 words. (Tested on a RTX 4090). For smaller phrases, the latency was roughly <=1 sec (Note:- These were the observations for the generation of audio with the streaming part) Coming to the streaming, the clicking sounds are still present, but the main doubt is, if the model takes around 2-4 secs for a sentence input, then I think we can't ensure to provide a realtime experience, as there will be some latency for the 1st phrase's output. And then later on, the subsequent phrases can be processed and streamed to the client, but for each turn, there might be some latency, depending upon the sentence length. Is there any way to solve this? I tried by splitting the text into words, but the output was inconsistent, each chunk was with a different tone and didn't sound natural. |
Did you try my streaming-sample.py? You need to use something like gradio to actually see the streaming happening. |
Yeah, I did use the steaming_sample.py, for smaller text inputs, it worked fine, but for some long texts there were some clicking sounds in between, and took some time for the first audio output. |
This pull request introduces a new feature for streaming audio generation using the Zonos model. The changes include updates to the sample code, a new streaming sample script, and modifications to the Zonos model to support streaming.
New Feature: Streaming Audio Generation
sample.py
: Updated the text prompt to "Hello, world! This is a test of streaming generation from Zonos." and assigned it to thetext
variable before creating the conditioning dictionary.streaming-sample.py
: Added a new script to demonstrate streaming audio generation. This script includes loading the model, preparing conditioning, and generating audio in chunks, which are then saved to a WAV file.Model Enhancements
zonos/model.py
: Added thestream
method to the Zonos model for streaming audio generation. This method generates audio in chunks and yields them as they are produced.Additional Imports
zonos/model.py
: Added theGenerator
type from thetyping
module and thenumpy
library to support the new streaming functionality.