If you are accessing this repo via GitHub, please see the project page on DAGSHub for data files, pipelines and more.
First install:
- Conda
- Rust compiler:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
- Reopen shell or run
source $HOME/.cargo/env
Then create and activate the UNIKUD environment with:
conda env create -f environment.yml
conda activate unikud
You may then download the required data files using DVC:
dvc remote add origin https://dagshub.com/morrisalp/unikud.dvc
dvc pull -r origin
Sources of data:
- Public-domain works from the Ben-Yehuda Project
- Wikimedia sources:
- Hebrew Wikipedia
- Hebrew Wikisource (ויקיטקסט)
- Hebrew Wiktionary (ויקימילון)
To preprocess data, run:
To reproduce the training pipeline, perform the following steps:
- Preprocess data:
dvc repro preprocessing
- Train ktiv male model:
dvc repro train-ktiv-male
Training steps will automatically log to MLflow (via the Huggingface Trainer object) if the following environment variables are set: MLFLOW_TRACKING_URI
, MLFLOW_TRACKING_USERNAME
, MLFLOW_TRACKING_PASSWORD
.