To use this repository, clone the repository and install the required dependencies:
git clone https://github.com/your-username/DyDTS.git
cd DyDTS
We recommend using a virtual environment (e.g., venv, conda) to install the dependencies.
pip install -r requirements.txt
DialSeg711 is a real-world dataset consisting of 711 English dialogues, sourced from MultiWOZ and KVRET. It exhibits an average of 4.9 topic segments and 5.6 utterances per segment. Doc2Dial is a synthetic dataset comprising over 4,100 English conversations grounded in 450+ documents across four domains. It presents an average of 3.7 topic segments and 3.5 utterances per segment.
Datasets | DialSeg711 | Doc2Dial |
---|---|---|
#samples | 711 | 4100 |
#Avg. Topic Segments/Dialogue | 4.9 | 3.7 |
#Avg. Utterances/Topic Segments | 3.7 | 3.5 |
Prepare your dialogue data in the required format. The dataset should consist of a series of utterances, where each dialogue is represented as a sequence of text. The dataset is available right here
python data_prepare.py --data_dir data/dialseg711 --file_name 711.pkl --output_dir processed_711_data --model_name sup-simcse-bert-base-uncased
To train the model on your dataset:
python train.py --data_dir processed_711_data --model_name sup-simcse-bert-base-uncased --output_dir model_711_trained
To evaluate the model's performance, we provide evaluation scripts and model for calculating various metrics, such as Pk and WD, based on the segmented output:
python inference.py --data_dir data/dialseg711 --model_name sup-simcse-bert-base-uncased --output_dir model_711
We welcome contributions to improve the ATBR method. Feel free to fork the repository and submit pull requests for:
- Bug fixes
- Feature enhancements
- Improvements to the documentation
For any questions, feel free to open an issue or contact the project maintainers.