Add AudioLDM 2 (huggingface#4549)

* from audioldm * unet down + mid * vae, clap, flan-t5 * start sequence audio mae * iterate on audioldm encoder * finish encoder * finish weight conversion * text pre-processing * gpt2 pre-processing * fix projection model * working * unet equivalence * finish in base * add unet cond * finish unet * finish custom unet * start clean-up * revert base unet changes * refactor pre-processing * tests: from audioldm * fix some tests * more fixes * iterate on tests * make fix copies * harden fast tests * slow integration tests * finish tests * update checkpoint * update copyright * docs * remove outdated method * add docstring * make style * remove decode latents * enable cpu offload * (text_encoder_1, tokenizer_1) -> (text_encoder, tokenizer) * more clean up * more refactor * build pr docs * Update docs/source/en/api/pipelines/audioldm2.md Co-authored-by: Sayak Paul <[email protected]> * small clean * tidy conversion * update for large checkpoint * generate -> generate_language_model * full clap model * shrink clap-audio in tests * fix large integration test * fix fast tests * use generation config * make style * update docs * finish docs * finish doc * update tests * fix last test * syntax * finalise tests * refactor projection model in prep for TTS * fix fast tests * style --------- Co-authored-by: Sayak Paul <[email protected]>
3x0dv5 · Aug 21, 2023 · 7a24977 · 7a24977
1 parent 74d902e
commit 7a24977
Show file tree

Hide file tree

Showing 12 changed files with 4,350 additions and 0 deletions.
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -190,6 +190,8 @@
       title: Audio Diffusion
     - local: api/pipelines/audioldm
       title: AudioLDM
+    - local: api/pipelines/audioldm2
+      title: AudioLDM 2
     - local: api/pipelines/auto_pipeline
       title: AutoPipeline
     - local: api/pipelines/consistency_models

diff --git a/docs/source/en/api/pipelines/audioldm2.md b/docs/source/en/api/pipelines/audioldm2.md
@@ -0,0 +1,116 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# AudioLDM 2
+
+AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) 
+by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate 
+text-conditional sound effects, human speech and music.
+
+Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2
+is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two 
+text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap)
+and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings 
+are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2ProjectionModel). 
+A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively 
+predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding 
+vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/api/pipelines/audioldm2/AudioLDM2UNet2DConditionModel) 
+of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention 
+conditioning, as in most other LDMs.
+
+The abstract of the paper is the following:
+
+*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.*
+
+This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be 
+found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). 
+
+## Tips
+
+### Choosing a checkpoint
+
+AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. See table below for details on the three official checkpoints:
+
+| Checkpoint                                                      | Task          | Model Size | Training Data / h |
+|-----------------------------------------------------------------|---------------|------------|-------------------|
+| [audioldm2](https://huggingface.co/cvssp/audioldm2)             | Text-to-audio | 1.1B       | 1150k             |
+| [audioldm2-music](https://huggingface.co/cvssp/audioldm2-music) | Text-to-music | 1.1B       | 665k              |
+| [audioldm2-large](https://huggingface.co/cvssp/audioldm2-large) | Text-to-audio | 1.5B       | 1150k             |
+
+### Constructing a prompt
+
+* Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
+* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
+* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." 
+
+### Controlling inference
+
+* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
+* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.
+
+### Evaluating generated waveforms:
+
+* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation
+* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
+
+The following example demonstrates how to construct good music generation using the aforementioned tips: 
+
+```python
+import scipy
+import torch
+from diffusers import AudioLDM2Pipeline
+
+# load the best weights for music generation
+repo_id = "cvssp/audioldm2-music"
+pipe = AudioLDM2Pipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+
+# define the prompts
+prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
+negative_prompt = "Low quality."
+
+# set the seed
+generator = torch.Generator("cuda").manual_seed(0)
+
+# run the generation
+audio = pipe(
+    prompt,
+    negative_prompt=negative_prompt,
+    num_inference_steps=200,
+    audio_length_in_s=10.0,
+    num_waveforms_per_prompt=3,
+).audios
+
+# save the best audio sample (index 0) as a .wav file
+scipy.io.wavfile.write("techno.wav", rate=16000, data=audio[0])
+```
+
+<Tip>
+
+Make sure to check out the Schedulers [guide](/using-diffusers/schedulers) to learn how to explore the tradeoff between 
+scheduler speed and quality, and see the [reuse components across pipelines](/using-diffusers/loading#reuse-components-across-pipelines) 
+section to learn how to efficiently load the same components into multiple pipelines.
+
+</Tip>
+
+## AudioLDM2Pipeline
+[[autodoc]] AudioLDM2Pipeline
+	- all
+	- __call__
+
+## AudioLDM2ProjectionModel
+[[autodoc]] AudioLDM2ProjectionModel
+	- forward
+
+## AudioLDM2UNet2DConditionModel
+[[autodoc]] AudioLDM2UNet2DConditionModel
+	- forward