Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update on model development #26

Open
ClarissaGazalaEvanthe opened this issue Jul 31, 2023 · 11 comments
Open

Update on model development #26

ClarissaGazalaEvanthe opened this issue Jul 31, 2023 · 11 comments

Comments

@ClarissaGazalaEvanthe
Copy link

Please make a simple model for this test program, which can be used immediately. I'm not very good at python, sorry to bother you

@PABannier
Copy link
Owner

Hi @gwcangtip ! The first model supported is the basic 24KHz model presented by Suno in their demo. It should be available by the end of this week.

@planatscher
Copy link

This will be really awesome. Can't wait to use it!

@ClarissaGazalaEvanthe
Copy link
Author

Hi @gwcangtip ! The first model supported is the basic 24KHz model presented by Suno in their demo. It should be available by the end of this week.

When do you make ready-to-use models?

@PABannier
Copy link
Owner

Hi @gwcangtip ! Thanks for the interest for the repo. I'm making a quick update for anyone interested in bark.cpp. I've spent the past week cleaning the repo and making sure the implementations of the 3 encoders were right. I have yet to integrate encodec.cpp (already implemented here) to bark.cpp. I'm making the final stretch of work this week.

@PABannier
Copy link
Owner

PABannier commented Aug 11, 2023

Hello everyone! Quick update on the recent progress made in the last week.

All the components (the 3 encoders and the Codec model) are now implemented and working. The end-to-end pipeline works fine, and I do obtain high-quality audio in output. Currently, I still have spotted 2 bugs (one in the tokenizer, one in the fine encoder) which make the model produce non-sense for some inputs. After fixing these two bugs, we should have a first working version of bark.

Regarding the performance, the model takes 17 seconds on my MacBook Pro M2 to generate a 2-second audio. There are still a lot of improvements to be made (unnecessary memory copies for instance) on the codebase. Furthermore, I expect significant improvement in speed once we support mixed precision and quantization. We have a dedicated issue (#46) to perform benchmarks and I'll publish them in the README once the aforementioned bugs are fixed.

@PABannier PABannier changed the title Can this be used already? how to get models? Update on model development Aug 11, 2023
@kskelm
Copy link

kskelm commented Aug 11, 2023

Thanks for all the work you've put into this, @PABannier ! I can't wait to see this evolve as it gets more efficient.

@jzeiber
Copy link
Contributor

jzeiber commented Aug 12, 2023

Regarding the performance, the model takes 17 seconds on my MacBook Pro M2 to generate a 2-second audio.

On a Ryzen 3600 using 6 threads, I see about 2 minutes for the "this is an audio" prompt. That's with AVX2 enabled for GGML. I tried with OpenBLAS, but that was even slower. I'm not sure why it's so slow.

Also, what needs to be done to be able to reuse the model for subsequent calls to bark_generate_audio? I can put the calls to bark_generate_audio in the loop with the already loaded model, but after 5 or so calls it crashes because it can't allocate any more memory. I'm not sure what needs to be cleaned up between calls.

I've also tried with different seed values, and most of them sound terrible, or are not spoken audio at all.

@PABannier
Copy link
Owner

Hi @jzeiber ! Thanks for the info. As for the nonsense output, I have yet to fix a bug in the fine encoder. This is why we have poor output for most of the prompts.

As for memory allocation, have you tried re creating a GGML context for each model, every time you generate a prompt?

As for speed, I'm sure there are some memory leaks or unnecessary copies that I'll need to track down. But first i'm focusing on fixing the aforementioned bug in the fine encoder.

@jzeiber
Copy link
Contributor

jzeiber commented Aug 12, 2023

Hi @jzeiber ! Thanks for the info. As for the nonsense output, I have yet to fix a bug in the fine encoder. This is why we have poor output for most of the prompts.

Alright, that's makes sense. It was just quite curious that seed value 0 seems to be the best with different prompts. I'm not sure what's special about that seed.

As for memory allocation, have you tried re creating a GGML context for each model, every time you generate a prompt?

I haven't tried. I was trying to avoid having to reload the entire model each time, but if I can just recreate the model ctxs each time that should work.

As for speed, I'm sure there are some memory leaks or unnecessary copies that I'll need to track down. But first i'm focusing on fixing the aforementioned bug in the fine encoder.

Yes, that sounds good. Get the basics done first to get good output, then improve what's there. Great work so far!

@PABannier
Copy link
Owner

PABannier commented Aug 14, 2023

Quick update, I wrote 3 unit tests comparing the output of the fine encoder against the original bark implementation.

./data/fine/test_fine_1.bin
run_test_on_codes : failed test
       abs_tol=0.0100, rel_tol=0.0100, abs max viol=0.0917, viol=80.0%
   TEST 1 FAILED.
./data/fine/test_fine_2.bin
run_test_on_codes : failed test
       abs_tol=0.0100, rel_tol=0.0100, abs max viol=89.0242, viol=100.0%
   TEST 2 FAILED.
./data/fine/test_fine_3.bin
run_test_on_codes : failed test
       abs_tol=0.0100, rel_tol=0.0100, abs max viol=0.1022, viol=89.4%
   TEST 3 FAILED.

All tests are currently failing, meaning that the fine encoder is not correctly implemented. More interestingly, the absolute difference in the logits of the fine encoder is not significant for all the token sequences (e.g. test 1 with only an abs max viol of 0.0917). In practice, this gives noisy outputs or missing words in the generated audio. However, when the difference is significant (e.g. test 2 with a abs max viol of 89.0242), the model is spewing out non sense.

After investigation, the bug is in the non causal self attention block. Although the queries and keys are identical, KQ is completely different from q @ transpose(-2, -1) and full of almost zero values. This is strange: i've checked the dimensions, the strides (making the key and query tensors contiguous did not change anything) and obviously the coefficients as stated previously.

Pinging @jzeiber @Green-Sky @kskelm @jmtatsch as they are following the updates on the model development.

@PABannier
Copy link
Owner

For those interested in Bark,

We now have a first working stable version of bark.cpp that supports quantization with #139 !
Make sure to pull the latest version of Encodec and Bark, by following the instructions.

Feel free to send me any feedback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants