update docs

pykt-team · Aug 19, 2022 · 7ae834e · 7ae834e
1 parent 60c42b0
commit 7ae834e
Show file tree

Hide file tree

Showing 7 changed files with 227 additions and 170 deletions.
diff --git a/README.md b/README.md
@@ -22,39 +22,6 @@ source activate pykt
 pip install -U pykt-toolkit -i  https://pypi.python.org/simple 
 ```
 
-## Development
-1、Clone pykt repositoriy
-
-```shell
-git clone https://github.com/pykt-team/pykt-toolkit
-```
-
-2、Change to dev branch 
-
-```shell
-cd pykt-toolkit
-git checkout dev
-```
-
-**Do not** work on the main branch.
-
-3、Editable install
-
-You can use the following command to install the pykt library. 
-
-```shell
-pip install -e .
-```
-In this mode, every modification in `pykt` directory will take effect immediately. You do not need to reinstall the package again. 
-
-4、Push to remote(dev)
-
-After development models or fix bugs, you can push your codes to dev branch. 
-
-
-The main branch is **not allowed** to push codes (the push will be failed). You can use a Pull Request to merge your code from **dev** branch to the main branch. We will reject the Pull Request from another branch to main branch, you can merge to dev branch first.
-
-
 ## References
 ### Projects
 
@@ -91,48 +58,15 @@ The main branch is **not allowed** to push codes (the push will be failed). You
 
 
 
-<!-- 
-# How to use?
-
-CUDA_VISIBLE_DEVICES=3 python wandb_akt_train.py
-
-# description
-## preprocess: 
-The preprocess code for each dataseet.
-
-* assist2015_preprocess.py
-
-The preprocess code for assist2015 dataset.
-
-If you want to add a new dataseet, please write your own dataset preprocess code, to change the data to this format:
-```
-    uid,seq_len
-    questions ids / names
-    concept ids / names
-    timestamps
-    usetimes
-```
-a example like this:
-```
-    50121,4
-    106101,106102,106103,106104
-    7014,7012,7014,7013
-    0,1,1,1
-    1647409594,1647409601,1647409666,1647409694
-    123,234,456,789
-```
-* split_datasets.py
-
-Split the data into 5-fold for trainning and testing. 
-
-## data
-The data saved dir for each dataset.
-
-## datasets
-Including a data_loader.py to prepare data for trainning models.
+## Citation
 
-## models
-Including models: dkt, dkt+, dkvmn, sakt, saint, akt, kqn, atkt.
+We now have a [paper](https://arxiv.org/abs/2206.11460?context=cs.CY) you can cite for the our PyKT library:
 
-## others
-train.py: trainning code. -->
+```bibtex
+@article{liu2022pykt,
+  title={pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models},
+  author={Liu, Zitao and Liu, Qiongqiong and Chen, Jiahao and Huang, Shuyan and Tang, Jiliang and Luo, Weiqi},
+  journal={arXiv preprint arXiv:2206.11460},
+  year={2022}
+}
+```
diff --git a/docs/pics/dataset-add_data_path.jpg b/docs/pics/dataset-add_data_path.jpg
diff --git a/docs/pics/dataset-import.jpg b/docs/pics/dataset-import.jpg
diff --git a/docs/source/contribute.md b/docs/source/contribute.md
@@ -0,0 +1,160 @@
+# How to contribute to Pykt?
+Everyone is welcome to contribute, and we value everybody's contribution.
+
+
+## You can contribute in so many ways!
+There are some ways you can contribute to PyKT:
+1. Find bugs and create an issue.
+2. Add new datasets.
+3. Implementing new models.
+
+## Install for Development
+1、Clone pykt repositoriy
+
+```shell
+git clone https://github.com/pykt-team/pykt-toolkit
+```
+
+2、Change to dev branch 
+
+```shell
+cd pykt-toolkit
+git checkout dev
+```
+
+**Do not** work on the main branch.
+
+3、Editable install
+
+You can use the following command to install the pykt library. 
+
+```shell
+pip install -e .
+```
+In this mode, every modification in `pykt` directory will take effect immediately. You do not need to reinstall the package again. 
+
+4、Push to remote(dev)
+
+After development models or fix bugs, you can push your codes to dev branch. 
+
+
+The main branch is **not allowed** to push codes (the push will be failed). You can use a Pull Request to merge your code from **dev** branch to the main branch. We will reject the Pull Request from another branch to main branch, you can merge to dev branch first.
+
+
+
+## Add Your Datasets
+
+In this section, we will use the `ASSISTments2015` dataset to show the add dataset procedure. Here we use `assist2015` as the dataset name, you can change `assist2015` to your dataset name.
+
+### Add Data Files
+1、Add a new dataset folder in the `data` directory with the name of the dataset. 
+
+2、Then, you can store the raw files in this directory. Here is the `assist2015` file structure:
+
+```shell
+$tree data/assist2015/
+├── 2015_100_skill_builders_main_problems.csv
+```
+
+3、Then add the data path to `dname2paths` of `examples/data_preprocess.py`.
+
+![](../pics/dataset-add_data_path.jpg)
+
+### Write Python Script
+
+1、Create the processing script `assist2015_preprocess.py` in `pykt/preprocess` directory. Before write the preprocess python scipt you are suggestd to  read the [Data Preprocess Standards](#Data Preprocess Standards), which contains some guidlines to process dataset. Here is the scipt for `assist2015` we show the main steps, full codes can see in `pykt/preprocess/algebra2005_preprocess.py`.
+
+<!-- 
+```python
+import pandas as pd
+from pykt.utils import write_txt, change2timestamp, replace_text
+
+def read_data_from_csv(read_file, write_file):
+    # load the original data
+    df = pd.read_table(read_file, encoding = "utf-8", dtype=str, low_memory=False)
+    df["Problem Name"] = df["Problem Name"].apply(replace_text)
+    df["Step Name"] = df["Step Name"].apply(replace_text)
+    df["Questions"] = df.apply(lambda x:f"{x['Problem Name']}----{x['Step Name']}",axis=1)
+    
+
+    df["index"] = range(df.shape[0])
+    df = df.dropna(subset=["Anon Student Id", "Questions", "KC(Default)", "First Transaction Time", "Correct First Attempt"])
+    df = df[df["Correct First Attempt"].isin([str(0),str(1)])]#keep the interaction which response in [0,1]
+    df = df[["index", "Anon Student Id", "Questions", "KC(Default)", "First Transaction Time", "Correct First Attempt"]]
+    df["KC(Default)"] = df["KC(Default)"].apply(replace_text)
+
+    data = []
+    ui_df = df.groupby(['Anon Student Id'], sort=False)
+
+    for ui in ui_df:
+        u, curdf = ui[0], ui[1]
+        curdf.loc[:, "First Transaction Time"] = curdf.loc[:, "First Transaction Time"].apply(lambda t: change2timestamp(t))
+        curdf = curdf.sort_values(by=["First Transaction Time", "index"])
+        curdf["First Transaction Time"] = curdf["First Transaction Time"].astype(str)
+
+        seq_skills = [x.replace("~~", "_") for x in curdf["KC(Default)"].values]
+        seq_ans = curdf["Correct First Attempt"].values
+        seq_start_time = curdf["First Transaction Time"].values
+        seq_problems = curdf["Questions"].values
+        seq_len = len(seq_ans)
+        seq_use_time = ["NA"]
+        
+        data.append(
+            [[u, str(seq_len)], seq_problems, seq_skills, seq_ans, seq_start_time, seq_use_time])
+
+    write_txt(write_file, data)
+``` -->
+
+2、Import the preprocess file in `pykt/preprocess/data_proprocess.py`.
+
+
+![](../pics/dataset-import.jpg)
+
+
+
+### Data Preprocess Standards
+#### Field Extraction
+
+For any data set, we mainly extract 6 fields: user ID, question ID (name), skill ID (name), answering status, answer submission time, and answering time (if the field does not exist in the dataset, it is represented by NA) .
+
+#### Data Filtering
+
+For each answer record, if any of the five fields of user ID, question ID (name), skill ID (name), answer status, and answer submission time are empty, the answer record will be deleted.
+
+#### Data Sorting
+
+Each student's answer sequence is sorted according to the answer order of the students. If different answer records of the same student appear in the same order, the original order is maintained, that is, the order of the answer records in the original data set is kept consistent.
+
+#### Character Process
+
+- **Field concatenation:** Use `----` as the connecting symbol. For example, Algebra2005 needs to concatenate `Problem Name` and `Step Name` as the final problem name.
+- **Character replacement:** If there is an underline `_` in the question and skill of original data, replace it with `####`. If there is a comma `,` in the question and skill of original data, replace it with `@@@@`.
+- **Multi-skill separator:** If there are multiple skills in a question, we separate the skills with an underline `_`.
+- **Time format:** The answer submission time is a millisecond (ms) timestamp, and the answer time is in milliseconds (ms).
+
+#### Output data format
+
+After completing the above data preprocessing, each dataset will generate a data.txt file in the folder named after it (data directory). Each student sequence contains 6 rows of data as follows:
+
+```
+User ID, sequence length
+Question ID (name)
+skill ID (name)
+Answer status
+Answer submission time
+time to answer
+```
+
+Example:
+
+```
+50121, 4 
+106101, 106102, 106103, 106104 
+7014, 7012, 7014, 7013 
+0, 1, 1, 1 
+1647409594000, 1647409601000, 1647409666000, 1647409694000 
+123, 234, 456, 789 
+```
+
+
+<!-- ## Add Your Models(todo) -->
diff --git a/docs/source/datasets.md b/docs/source/datasets.md
@@ -1,55 +1,56 @@
-### Statics2011
+# Datasets
+## Statics2011
 This dataset is collected from an engineering statics course taught at the Carnegie Mellon University during Fall 2011. In this dataset, a unique question is constructed by concatenating the problem name and step name and the dataset has 194,947 interactions, 333 students, 1,224 questions.
 
 https://pslcdatashop.web.cmu.edu/DatasetInfo?datasetId=507
 
-### ASSISTments2009
+## ASSISTments2009
 
 This dataset is made up of math exercises, collected from the free online tutoring ASSISTments platform in the school year 2009-2010. The dataset consists of 346,860 interactions, 4,217 students, and 26,688 questions and is widely used and has been the standard benchmark for KT methods over the last decade.
 
 https://sites.google.com/site/assistmentsdata/home/2009-2010-assistment-data/skill-builder-data-2009-2010
 
-### ASSISTments2012
+## ASSISTments2012
 This is the ASSISTments data for the school year 2012~2013 with affect predictions. The dataset consists of 2,541,201 interactions, 27,066 students, and 45,716 questions.
 
 https://sites.google.com/site/assistmentsdata/datasets/2012-13-school-data-with-affect
 
-### ASSISTments2015
+## ASSISTments2015
 
 Similar to ASSISTments2009, this dataset is collected from the ASSISTments platform in the year of 2015. It includes 708,631 interactions on 100 distinct KCs from 19,917 students. This dataset has the largest number of students among the other ASSISTments datasets.
 
 https://sites.google.com/site/assistmentsdata/datasets/2015-assistments-skill-builder-data
 
-### ASSISTments2017
+## ASSISTments2017
 
 This dataset is from the 2017 data mining competition. It consists of 942,816 interactions, 686 students, and 102 questions. 
 
 https://sites.google.com/view/assistmentsdatamining/dataset?authuser=0
 
-### Algebra2005
+## Algebra2005
 This dataset is from the KDD Cup 2010 EDM Challenge that contains 13-14 year old students’ responses to Algebra questions. It contains detailed step-level student responses. The unique question construction is similar to the process used in Statics2011, which ends up with 809,694 interactions, 574 students, 210,710 questions and 112 KCs.
 
 https://pslcdatashop.web.cmu.edu/KDDCup/
 
-### Bridge2006
+## Bridge2006
 
 This dataset is also from the KDD Cup 2010 EDM Challenge and the unique question construction is similar to the process used in Statics2011. There are 3,679,199 interactions, 1,146 students, 207,856 questions and 493 KCs in the dataset.
 
 https://pslcdatashop.web.cmu.edu/KDDCup/
 
-### Ednet
+## Ednet
 
 The large-scale hierarchical student activity data set collected by Santa (an artificial intelligence guidance system) contains 131317236 interactive information of 784309 students, which is the largest public interactive education system data set released so far.
 
 https://github.com/riiid/ednet
 
-### NIPS34
+## NIPS34
 
 This dataset is from the Tasks 3 & 4 at the NeurIPS 2020 Education Challenge. It contains students’ answers to multiple-choice diagnostic math questions and is collected from the Eedi platform. For each question, we choose to use the leaf nodes from the subject tree as its KCs, which ends up with 1,382,727 interactions, 948 questions, and 57 KCs.
 
 https://eedi.com/projects/neurips-education-challenge
 
-### POJ
+## POJ
 This dataset consists of programming exercises and is collected from Peking coding practice online platform. The dataset is originally scraped by Pandey and Srivastava. In total, it has 996,240 interactions, 22,916 students, and 2,750 questions.
 
 https://drive.google.com/drive/folders/1LRljqWfODwTYRMPw6wEJ_mMt1KZ4xBDk
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -11,11 +11,11 @@ More details about the academic information can be read in our paper at https://
    :maxdepth: 2
    :caption: Home
 
-   Installation <installation>
-   Official Website <https://pykt-team.github.io/>
    Quick Start <quick_start>
    Quick Start (cn) <quick_start_cn>
    Models <models>
+   Datasets <datasets>
+   Contribute <contribute>
 
 .. toctree::
    :maxdepth: 1