Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
meta-tabchen committed Aug 19, 2022
1 parent 60c42b0 commit 7ae834e
Show file tree
Hide file tree
Showing 7 changed files with 227 additions and 170 deletions.
86 changes: 10 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,39 +22,6 @@ source activate pykt
pip install -U pykt-toolkit -i https://pypi.python.org/simple
```

## Development
1、Clone pykt repositoriy

```shell
git clone https://github.com/pykt-team/pykt-toolkit
```

2、Change to dev branch

```shell
cd pykt-toolkit
git checkout dev
```

**Do not** work on the main branch.

3、Editable install

You can use the following command to install the pykt library.

```shell
pip install -e .
```
In this mode, every modification in `pykt` directory will take effect immediately. You do not need to reinstall the package again.

4、Push to remote(dev)

After development models or fix bugs, you can push your codes to dev branch.


The main branch is **not allowed** to push codes (the push will be failed). You can use a Pull Request to merge your code from **dev** branch to the main branch. We will reject the Pull Request from another branch to main branch, you can merge to dev branch first.


## References
### Projects

Expand Down Expand Up @@ -91,48 +58,15 @@ The main branch is **not allowed** to push codes (the push will be failed). You



<!--
# How to use?
CUDA_VISIBLE_DEVICES=3 python wandb_akt_train.py
# description
## preprocess:
The preprocess code for each dataseet.
* assist2015_preprocess.py
The preprocess code for assist2015 dataset.
If you want to add a new dataseet, please write your own dataset preprocess code, to change the data to this format:
```
uid,seq_len
questions ids / names
concept ids / names
timestamps
usetimes
```
a example like this:
```
50121,4
106101,106102,106103,106104
7014,7012,7014,7013
0,1,1,1
1647409594,1647409601,1647409666,1647409694
123,234,456,789
```
* split_datasets.py
Split the data into 5-fold for trainning and testing.
## data
The data saved dir for each dataset.
## datasets
Including a data_loader.py to prepare data for trainning models.
## Citation

## models
Including models: dkt, dkt+, dkvmn, sakt, saint, akt, kqn, atkt.
We now have a [paper](https://arxiv.org/abs/2206.11460?context=cs.CY) you can cite for the our PyKT library:

## others
train.py: trainning code. -->
```bibtex
@article{liu2022pykt,
title={pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models},
author={Liu, Zitao and Liu, Qiongqiong and Chen, Jiahao and Huang, Shuyan and Tang, Jiliang and Luo, Weiqi},
journal={arXiv preprint arXiv:2206.11460},
year={2022}
}
```
Binary file added docs/pics/dataset-add_data_path.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/pics/dataset-import.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
160 changes: 160 additions & 0 deletions docs/source/contribute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# How to contribute to Pykt?
Everyone is welcome to contribute, and we value everybody's contribution.


## You can contribute in so many ways!
There are some ways you can contribute to PyKT:
1. Find bugs and create an issue.
2. Add new datasets.
3. Implementing new models.

## Install for Development
1、Clone pykt repositoriy

```shell
git clone https://github.com/pykt-team/pykt-toolkit
```

2、Change to dev branch

```shell
cd pykt-toolkit
git checkout dev
```

**Do not** work on the main branch.

3、Editable install

You can use the following command to install the pykt library.

```shell
pip install -e .
```
In this mode, every modification in `pykt` directory will take effect immediately. You do not need to reinstall the package again.

4、Push to remote(dev)

After development models or fix bugs, you can push your codes to dev branch.


The main branch is **not allowed** to push codes (the push will be failed). You can use a Pull Request to merge your code from **dev** branch to the main branch. We will reject the Pull Request from another branch to main branch, you can merge to dev branch first.



## Add Your Datasets

In this section, we will use the `ASSISTments2015` dataset to show the add dataset procedure. Here we use `assist2015` as the dataset name, you can change `assist2015` to your dataset name.

### Add Data Files
1、Add a new dataset folder in the `data` directory with the name of the dataset.

2、Then, you can store the raw files in this directory. Here is the `assist2015` file structure:

```shell
$tree data/assist2015/
├── 2015_100_skill_builders_main_problems.csv
```

3、Then add the data path to `dname2paths` of `examples/data_preprocess.py`.

![](../pics/dataset-add_data_path.jpg)

### Write Python Script

1、Create the processing script `assist2015_preprocess.py` in `pykt/preprocess` directory. Before write the preprocess python scipt you are suggestd to read the [Data Preprocess Standards](#Data Preprocess Standards), which contains some guidlines to process dataset. Here is the scipt for `assist2015` we show the main steps, full codes can see in `pykt/preprocess/algebra2005_preprocess.py`.

<!--
```python
import pandas as pd
from pykt.utils import write_txt, change2timestamp, replace_text
def read_data_from_csv(read_file, write_file):
# load the original data
df = pd.read_table(read_file, encoding = "utf-8", dtype=str, low_memory=False)
df["Problem Name"] = df["Problem Name"].apply(replace_text)
df["Step Name"] = df["Step Name"].apply(replace_text)
df["Questions"] = df.apply(lambda x:f"{x['Problem Name']}----{x['Step Name']}",axis=1)
df["index"] = range(df.shape[0])
df = df.dropna(subset=["Anon Student Id", "Questions", "KC(Default)", "First Transaction Time", "Correct First Attempt"])
df = df[df["Correct First Attempt"].isin([str(0),str(1)])]#keep the interaction which response in [0,1]
df = df[["index", "Anon Student Id", "Questions", "KC(Default)", "First Transaction Time", "Correct First Attempt"]]
df["KC(Default)"] = df["KC(Default)"].apply(replace_text)
data = []
ui_df = df.groupby(['Anon Student Id'], sort=False)
for ui in ui_df:
u, curdf = ui[0], ui[1]
curdf.loc[:, "First Transaction Time"] = curdf.loc[:, "First Transaction Time"].apply(lambda t: change2timestamp(t))
curdf = curdf.sort_values(by=["First Transaction Time", "index"])
curdf["First Transaction Time"] = curdf["First Transaction Time"].astype(str)
seq_skills = [x.replace("~~", "_") for x in curdf["KC(Default)"].values]
seq_ans = curdf["Correct First Attempt"].values
seq_start_time = curdf["First Transaction Time"].values
seq_problems = curdf["Questions"].values
seq_len = len(seq_ans)
seq_use_time = ["NA"]
data.append(
[[u, str(seq_len)], seq_problems, seq_skills, seq_ans, seq_start_time, seq_use_time])
write_txt(write_file, data)
``` -->

2、Import the preprocess file in `pykt/preprocess/data_proprocess.py`.


![](../pics/dataset-import.jpg)



### Data Preprocess Standards
#### Field Extraction

For any data set, we mainly extract 6 fields: user ID, question ID (name), skill ID (name), answering status, answer submission time, and answering time (if the field does not exist in the dataset, it is represented by NA) .

#### Data Filtering

For each answer record, if any of the five fields of user ID, question ID (name), skill ID (name), answer status, and answer submission time are empty, the answer record will be deleted.

#### Data Sorting

Each student's answer sequence is sorted according to the answer order of the students. If different answer records of the same student appear in the same order, the original order is maintained, that is, the order of the answer records in the original data set is kept consistent.

#### Character Process

- **Field concatenation:** Use `----` as the connecting symbol. For example, Algebra2005 needs to concatenate `Problem Name` and `Step Name` as the final problem name.
- **Character replacement:** If there is an underline `_` in the question and skill of original data, replace it with `####`. If there is a comma `,` in the question and skill of original data, replace it with `@@@@`.
- **Multi-skill separator:** If there are multiple skills in a question, we separate the skills with an underline `_`.
- **Time format:** The answer submission time is a millisecond (ms) timestamp, and the answer time is in milliseconds (ms).

#### Output data format

After completing the above data preprocessing, each dataset will generate a data.txt file in the folder named after it (data directory). Each student sequence contains 6 rows of data as follows:

```
User ID, sequence length
Question ID (name)
skill ID (name)
Answer status
Answer submission time
time to answer
```

Example:

```
50121, 4
106101, 106102, 106103, 106104
7014, 7012, 7014, 7013
0, 1, 1, 1
1647409594000, 1647409601000, 1647409666000, 1647409694000
123, 234, 456, 789
```


<!-- ## Add Your Models(todo) -->
21 changes: 11 additions & 10 deletions docs/source/datasets.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,56 @@
### Statics2011
# Datasets
## Statics2011
This dataset is collected from an engineering statics course taught at the Carnegie Mellon University during Fall 2011. In this dataset, a unique question is constructed by concatenating the problem name and step name and the dataset has 194,947 interactions, 333 students, 1,224 questions.

https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507

### ASSISTments2009
## ASSISTments2009

This dataset is made up of math exercises, collected from the free online tutoring ASSISTments platform in the school year 2009-2010. The dataset consists of 346,860 interactions, 4,217 students, and 26,688 questions and is widely used and has been the standard benchmark for KT methods over the last decade.

https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data/skill-builder-data-2009-2010

### ASSISTments2012
## ASSISTments2012
This is the ASSISTments data for the school year 2012~2013 with affect predictions. The dataset consists of 2,541,201 interactions, 27,066 students, and 45,716 questions.

https://sites.google.com/site/assistmentsdata/datasets/2012-13-school-data-with-affect

### ASSISTments2015
## ASSISTments2015

Similar to ASSISTments2009, this dataset is collected from the ASSISTments platform in the year of 2015. It includes 708,631 interactions on 100 distinct KCs from 19,917 students. This dataset has the largest number of students among the other ASSISTments datasets.

https://sites.google.com/site/assistmentsdata/datasets/2015-assistments-skill-builder-data

### ASSISTments2017
## ASSISTments2017

This dataset is from the 2017 data mining competition. It consists of 942,816 interactions, 686 students, and 102 questions.

https://sites.google.com/view/assistmentsdatamining/dataset?authuser=0

### Algebra2005
## Algebra2005
This dataset is from the KDD Cup 2010 EDM Challenge that contains 13-14 year old students’ responses to Algebra questions. It contains detailed step-level student responses. The unique question construction is similar to the process used in Statics2011, which ends up with 809,694 interactions, 574 students, 210,710 questions and 112 KCs.

https://pslcdatashop.web.cmu.edu/KDDCup/

### Bridge2006
## Bridge2006

This dataset is also from the KDD Cup 2010 EDM Challenge and the unique question construction is similar to the process used in Statics2011. There are 3,679,199 interactions, 1,146 students, 207,856 questions and 493 KCs in the dataset.

https://pslcdatashop.web.cmu.edu/KDDCup/

### Ednet
## Ednet

The large-scale hierarchical student activity data set collected by Santa (an artificial intelligence guidance system) contains 131317236 interactive information of 784309 students, which is the largest public interactive education system data set released so far.

https://github.com/riiid/ednet

### NIPS34
## NIPS34

This dataset is from the Tasks 3 & 4 at the NeurIPS 2020 Education Challenge. It contains students’ answers to multiple-choice diagnostic math questions and is collected from the Eedi platform. For each question, we choose to use the leaf nodes from the subject tree as its KCs, which ends up with 1,382,727 interactions, 948 questions, and 57 KCs.

https://eedi.com/projects/neurips-education-challenge

### POJ
## POJ
This dataset consists of programming exercises and is collected from Peking coding practice online platform. The dataset is originally scraped by Pandey and Srivastava. In total, it has 996,240 interactions, 22,916 students, and 2,750 questions.

https://drive.google.com/drive/folders/1LRljqWfODwTYRMPw6wEJ_mMt1KZ4xBDk
4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ More details about the academic information can be read in our paper at https://
:maxdepth: 2
:caption: Home

Installation <installation>
Official Website <https://pykt-team.github.io/>
Quick Start <quick_start>
Quick Start (cn) <quick_start_cn>
Models <models>
Datasets <datasets>
Contribute <contribute>

.. toctree::
:maxdepth: 1
Expand Down
Loading

0 comments on commit 7ae834e

Please sign in to comment.