forked from huggingface/transformers
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[marian] Automate Tatoeba-Challenge conversion (huggingface#7709)
- Loading branch information
Showing
5 changed files
with
1,347 additions
and
165 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
import tempfile | ||
import unittest | ||
|
||
from transformers.convert_marian_tatoeba_to_pytorch import TatoebaConverter | ||
from transformers.file_utils import cached_property | ||
from transformers.testing_utils import slow | ||
|
||
|
||
class TatoebaConversionTester(unittest.TestCase): | ||
@cached_property | ||
def resolver(self): | ||
tmp_dir = tempfile.mkdtemp() | ||
return TatoebaConverter(save_dir=tmp_dir) | ||
|
||
@slow | ||
def test_resolver(self): | ||
self.resolver.convert_models(["heb-eng"]) | ||
|
||
@slow | ||
def test_model_card(self): | ||
content, mmeta = self.resolver.write_model_card("opus-mt-he-en", dry_run=True) | ||
assert mmeta["long_pair"] == "heb-eng" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
Setup transformers following instructions in README.md, (I would fork first). | ||
```bash | ||
git clone [email protected]:huggingface/transformers.git | ||
cd transformers | ||
pip install -e . | ||
pip install pandas | ||
``` | ||
|
||
Get required metadata | ||
``` | ||
curl https://cdn-datasets.huggingface.co/language_codes/language-codes-3b2.csv > language-codes-3b2.csv | ||
curl https://cdn-datasets.huggingface.co/language_codes/iso-639-3.csv > iso-639-3.csv | ||
``` | ||
|
||
Install Tatoeba-Challenge repo inside transformers | ||
```bash | ||
git clone [email protected]:Helsinki-NLP/Tatoeba-Challenge.git | ||
``` | ||
|
||
To convert a few models, call the conversion script from command line: | ||
```bash | ||
python src/transformers/convert_marian_tatoeba_to_pytorch.py --models heb-eng eng-heb --save_dir converted | ||
``` | ||
|
||
To convert lots of models you can pass your list of Tatoeba model names to `resolver.convert_models` in a python client or script. | ||
|
||
```python | ||
from transformers.convert_marian_tatoeba_to_pytorch import TatoebaConverter | ||
resolver = TatoebaConverter(save_dir='converted') | ||
resolver.convert_models(['heb-eng', 'eng-heb']) | ||
``` | ||
|
||
|
||
### Upload converted models | ||
```bash | ||
cd converted | ||
transformers-cli login | ||
for FILE in *; do transformers-cli upload $FILE; done | ||
``` | ||
|
||
|
||
### Modifications | ||
- To change naming logic, change the code near `os.rename`. The model card creation code may also need to change. | ||
- To change model card content, you must modify `TatoebaCodeResolver.write_model_card` |
Oops, something went wrong.