Skip to content

Commit

Permalink
Dev emo (fishaudio#171)
Browse files Browse the repository at this point in the history
* SYNC CHANGE TO EMO BRANCH (fishaudio#162)

* Update README.md

* 更新 bert_models.json

* fix

* Update data_utils.py

* Update infer.py

* performance improve

* Feat: support auto split in webui (fishaudio#158)

* Feat: support auto split in webui

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix: change /voice api to post (fishaudio#160)

* Fix: change /voice api to post

* Fix: support /voice api get

* Fix: Add missing torch.cuda.empty_cache() (fishaudio#161)

---------

Co-authored-by: Sora <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Artrajz <[email protected]>

* sync  (fishaudio#163)

* Update README.md

* 更新 bert_models.json

* fix

* Update data_utils.py

* Update infer.py

* performance improve

* Feat: support auto split in webui (fishaudio#158)

* Feat: support auto split in webui

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix: change /voice api to post (fishaudio#160)

* Fix: change /voice api to post

* Fix: support /voice api get

* Fix: Add missing torch.cuda.empty_cache() (fishaudio#161)

* del emo

* del emo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Sora <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Artrajz <[email protected]>

* Add files via upload

* Update infer.py

* add emo

* add emo

* Update default_config.yml

* Fix slice segments GPU perf (fishaudio#165)

* Fix slice segments GPU perf

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update commons.py

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update infer.py

* Update models.py

* Update infer.py

* remove spec cache

* Update data_utils.py

* Update data_utils.py

* Update train_ms.py

* Revert "Fix slice segments GPU perf (fishaudio#165)" (fishaudio#169)

This reverts commit 28430fc.

* Update train_ms.py

* Update train_ms.py

* Update data_utils.py

* Update data_utils.py

* Update train_ms.py

* Update train_ms.py

* Update train_ms.py

* Update train_ms.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update default_config.yml

* Switch to Japanese wwm DeBERTa (fishaudio#172)

* Switch to Japanese wwm DeBERTa

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix wrong ellipsis g2p (fishaudio#173)

* Switch to Japanese wwm DeBERTa

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix ellipsis g2p

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add files via upload

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix English phones not aligned with BERT features (fishaudio#174)

* Fix English phones not aligned with BERT features

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix english bert gen (fishaudio#175)

* Update webui.py

* Update webui.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add NCCL timeout

* Update train_ms.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update train_ms.py

* Update default_config.yml

* Update infer.py

* Update models.py

* Update train_ms.py

* Update infer.py

* Update emo_gen.py

* Feat: Support load and infer 2.0 models (fishaudio#178)

* Feat: Support load and infer 2.0 models

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* 复用相同逻辑,修正静音添加错误 (fishaudio#181)

* Refactor: reuse the same part of voice api.

* Fix: server_fastapi.py

* Update train_ms.py

* Update data_utils.py

* Update data_utils.py

* Update train_ms.py

* Update train_ms.py

* Update train_ms.py

* Update train_ms.py

* Update data_utils.py

* Update data_utils.py

* Add files via upload

* Update train_ms.py

* Update train_ms.py

* Update train_ms.py

* Update default_config.yml

* Update utils.py

* Update train_ms.py

* Update utils.py

* Update default_config.yml

* Update data_utils.py

* Update default_config.yml

* Update train_ms.py

* Update train_ms.py

* Update config.py

* Update utils.py

* Update train_ms.py

* Update train_ms.py

* feat: add voice mix and tone mix (fishaudio#187)

* feat: add voice mix and tone mix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Stardust·减 <[email protected]>

* Add files via upload

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Sora <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Artrajz <[email protected]>
Co-authored-by: Leng Yue <[email protected]>
Co-authored-by: OedoSoldier <[email protected]>
Co-authored-by: 潮幻Mark <[email protected]>
  • Loading branch information
7 people authored Nov 25, 2023
1 parent ec9145c commit b186499
Show file tree
Hide file tree
Showing 38 changed files with 156,672 additions and 839 deletions.
34 changes: 34 additions & 0 deletions bert/deberta-v2-large-japanese-char-wwm/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
89 changes: 89 additions & 0 deletions bert/deberta-v2-large-japanese-char-wwm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
language: ja
license: cc-by-sa-4.0
library_name: transformers
tags:
- deberta
- deberta-v2
- fill-mask
- character
- wwm
datasets:
- wikipedia
- cc100
- oscar
metrics:
- accuracy
mask_token: "[MASK]"
widget:
- text: "京都大学で自然言語処理を[MASK][MASK]する。"
---

# Model Card for Japanese character-level DeBERTa V2 large

## Model description

This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.
This model is trained with character-level tokenization and whole word masking.

## How to use

You can use this model for masked language modeling as follows:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese-char-wwm')
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese-char-wwm')

sentence = '京都大学で自然言語処理を[MASK][MASK]する。'
encoding = tokenizer(sentence, return_tensors='pt')
...
```

You can also fine-tune this model on downstream tasks.

## Tokenization

There is no need to tokenize texts in advance, and you can give raw texts to the tokenizer.
The texts are tokenized into character-level tokens by [sentencepiece](https://github.com/google/sentencepiece).

## Training data

We used the following corpora for pre-training:

- Japanese Wikipedia (as of 20221020, 3.2GB, 27M sentences, 1.3M documents)
- Japanese portion of CC-100 (85GB, 619M sentences, 66M documents)
- Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)

Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of CC-100 and OSCAR. As a result, the total size of the training data is 171GB.

## Training procedure

We first segmented texts in the corpora into words using [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) for whole word masking.
Then, we built a sentencepiece model with 22,012 tokens including all characters that appear in the training corpus.

We tokenized raw corpora into character-level subwords using the sentencepiece model and trained the Japanese DeBERTa model using [transformers](https://github.com/huggingface/transformers) library.
The training took 26 days using 16 NVIDIA A100-SXM4-40GB GPUs.

The following hyperparameters were used during pre-training:

- learning_rate: 1e-4
- per_device_train_batch_size: 26
- distributed_type: multi-GPU
- num_devices: 16
- gradient_accumulation_steps: 8
- total_train_batch_size: 3,328
- max_seq_length: 512
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
- lr_scheduler_type: linear schedule with warmup (lr = 0 at 300k steps)
- training_steps: 260,000
- warmup_steps: 10,000

The accuracy of the trained model on the masked language modeling task was 0.795.
The evaluation set consists of 5,000 randomly sampled documents from each of the training corpora.

## Acknowledgments

This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models".
For training models, we used the mdx: a platform for the data-driven future.
37 changes: 37 additions & 0 deletions bert/deberta-v2-large-japanese-char-wwm/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
{
"architectures": [
"DebertaV2ForMaskedLM"
],
"attention_head_size": 64,
"attention_probs_dropout_prob": 0.1,
"conv_act": "gelu",
"conv_kernel_size": 3,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-07,
"max_position_embeddings": 512,
"max_relative_positions": -1,
"model_type": "deberta-v2",
"norm_rel_ebd": "layer_norm",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"pooler_dropout": 0,
"pooler_hidden_act": "gelu",
"pooler_hidden_size": 1024,
"pos_att_type": [
"p2c",
"c2p"
],
"position_biased_input": false,
"position_buckets": 256,
"relative_attention": true,
"share_att_key": true,
"torch_dtype": "float16",
"transformers_version": "4.25.1",
"type_vocab_size": 0,
"vocab_size": 22012
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"cls_token": "[CLS]",
"mask_token": "[MASK]",
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"unk_token": "[UNK]"
}
19 changes: 19 additions & 0 deletions bert/deberta-v2-large-japanese-char-wwm/tokenizer_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"cls_token": "[CLS]",
"do_lower_case": false,
"do_subword_tokenize": true,
"do_word_tokenize": true,
"jumanpp_kwargs": null,
"mask_token": "[MASK]",
"mecab_kwargs": null,
"model_max_length": 1000000000000000019884624838656,
"never_split": null,
"pad_token": "[PAD]",
"sep_token": "[SEP]",
"special_tokens_map_file": null,
"subword_tokenizer_type": "character",
"sudachi_kwargs": null,
"tokenizer_class": "BertJapaneseTokenizer",
"unk_token": "[UNK]",
"word_tokenizer_type": "basic"
}
Loading

0 comments on commit b186499

Please sign in to comment.