Skip to content

An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

License

Notifications You must be signed in to change notification settings

Jemma007/KuaiRand

 
 

Repository files navigation

KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

LICENSE

KuaiRand is an unbiased sequential recommendation dataset collected from the recommendation logs of the video-sharing mobile app, Kuaishou (快手). It is the first recommendation dataset with millions of intervened interactions of randomly exposed items inserted in the standard recommendation feeds!

Another related open-sourced dataset is KuaiRec.

Overview:

The following figure gives an example of the dataset. It illustrates a user interaction sequence along with the user's rich feedback signals.

kuaidata

These feedback signals are collected from the two main user interfaces (UI) in Kuaishou APP shown as follows:

kuaishou-app

In the random exposure stage, each recommended video in the dataset has an equal probability being replaced by a random video sampled from an item pools. About $0.37%$ Interactions are replaced in the final results.

Advantages:

Compared with other datasets with random exposure, KuaiRand has the following advantages:

  • ✅ It is the first sequential recommendation dataset with millions of intervened interactions of randomly exposed items inserted in the standard recommendation feeds.
  • ✅ It has the most comprehensive side information including explicit user IDs, interaction timestamps, and rich features for users and items.
  • ✅ It has 15 policies with each catered for a special recommendation scenario in Kuaishou App.
  • ✅ We introduce 12 feedback signals (e.g., click, like, and view time) for each interaction to describe the user's comprehensive feedback.
  • ✅ Each user has thousands of historical interactions on average.
  • ✅ It has three versions to support various research directions in recommendation.

If you find it helpful, please cite our paper: LINK PDF

@inproceedings{gao2022kuairand,
  title = {KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos},
  author = {Gao, Chongming and Li, Shijun and Zhang, Yuan and Chen, Jiawei and Li, Biao and Lei, Wenqiang and Jiang, Peng and He, Xiangnan},
  url = {https://doi.org/10.1145/3511808.3557624},
  doi = {10.1145/3511808.3557624},
  booktitle = {Proceedings of the 31st ACM International Conference on Information and Knowledge Management},
  series = {CIKM '22},
  location = {Atlanta, GA, USA},
  numpages = {5},
  year = {2022},
  pages = {3953–3957}
}

News!

2022.08.23: Cyberspace Administration of China (CAC) released a series of measures to control cross-border data transfer on July 7, 2022. The full text of the Measures for Security Assessment of Cross-border Data Transfer (《数据出境安全评估办法》) can be found here (currently available only in Mandarin Chinese and there are some English posts such as this). The Measures will take effect on September 1, 2022. Therefore, for now, we cannot provide this dataset to institutions outside China until CAC completes our assessment and gives us permission to share the data overseas. We sincerely apologize for the inconvenience.

2022.08.23: 国家互联网信息办公室于2022年7月7日出台了保护国家安全及个人隐私的《数据出境安全评估办法》,于2022年9月1日开始执行。其规定公司数据不能跨境自由流动。于是我们按照规定中止对国外机构提供该数据集,目前仅能将该数据提供给中国境内机构使用。在国家互联网信息办公室批准我们的数据出境请求后,我们将马上向所有人公开该数据。

2022.08.02: Our CIKM '22 paper is accepted.

2022.06.02: The paper introducing this dataset is submitted to CIKM '22 Resource Track.


Download the data:

Due to regulations of the Chinese government (as mentioned above), for now, we have to use email to make sure that you are a researcher or practitioner in China. Then you have to sign some files to promise that you will NOT disclose this dataset. Please send us your name, institution (which must lie in China) via this email address: [email protected]

We are very sorry for the inconvenience! We will try to overcome this problem soon.

根据国家最近出台的个人隐私信息保护法以及数据出境安全评估办法,我们目前只提供数据集给中国机构(大学、研究所、公司),且需要您将您的姓名机构信息发给我方,我们将回复您相关的保密协议。在签署协议后,才能给您发送本数据集。联系邮箱:[email protected]

对此造成的不便,我们深感抱歉。我们将尽快推进向所有人公开数据的进程。


Three Versions and Suggestions:

We release three versions of KuaiRand for different usage:

  • KuaiRand-27K (23GB logs +23GB features): the complete KuaiRand dataset that has over 27K users and 32 millions videos.
  • KuaiRand-1K (829MB logs + 3.5GB features): randomly sample 1,000 users from KuaiRand-27K, then remove all irrelevant videos. There are 4 millions videos rest.
  • KuaiRand-Pure (184MB logs + 10MB features): only keeps the logs for the 7583 videos in the candidate pool.

The user_id and video_id are re-indexed. A visualization of their ID spaces is shown as follows.

kuaidata

The basic statistics of the three versions are summarized as follows:
Dataset Collection Policy #Users #Items #Interactions #User features #Item features Feedback Timestamp
KuaiRand-27K 15 policies 27,285 32,038,725 322,278,385 30 62 12 signals ✔️
Random policy 27,285 7,583 1,186,059 30 62 12 signals ✔️
KuaiRand-1K 15 policies 1,000 4,369,953 11,713,045 30 62 12 signals ✔️
Random policy 1,000 7,388 43,028 30 62 12 signals ✔️
KuaiRand-Pure 15 policies 27,285 7,551 1,436,609 30 62 12 signals ✔️
Random policy 27,285 7,583 1,186,059 30 62 12 signals ✔️

Which version should I use?

  • Reasons to use KuaiRand-27K or KuaiRand-1K:
    • Your research needs rigorous sequential logs, such as off-policy evaluation (OPE), Reinforcement learning (RL), or long sequential recommendation.
  • Reasons to use KuaiRand-Pure:
    • The sequential information is not necessary in your research OR If you are OK with the incomplete sequential logs. For example, if you are studying debiasing in collaborative filtering models or multi-task modeling in recommendation.
    • If your model can only run with small-size data.

Data Descriptions

The file structure of the three datasets are listed as follows:

KuaiRand-27K (46GB)
KuaiRand-27K
├── data (46GB)
│   ├── log_random_4_22_to_5_08_27k.csv (83MB)
│   ├── log_standard_4_08_to_4_21_27k_part1.csv (4.8GB)
│   ├── log_standard_4_08_to_4_21_27k_part2.csv (4.8GB)
│   ├── log_standard_4_22_to_5_08_27k_part1.csv (6.6GB)
│   ├── log_standard_4_22_to_5_08_27k_part2.csv (6.6GB)
│   ├── user_features_27k.csv (3.4MB)
│   ├── video_features_basic_27k.csv (2.6GB)
│   ├── video_features_statistic_27k_part1.csv (6.7GB)
│   ├── video_features_statistic_27k_part2.csv (6.7GB)
│   └── video_features_statistic_27k_part3.csv (6.7GB)
└── load_data_27k.py
KuaiRand-1K (4.3GB)
KuaiRand-1K
├── data (4.3GB)
│   ├── log_random_4_22_to_5_08_1k.csv (2.9MB)
│   ├── log_standard_4_08_to_4_21_1k.csv (368MB)
│   ├── log_standard_4_22_to_5_08_1k.csv (481MB)
│   ├── user_features_1k.csv (132KB)
│   ├── video_features_basic_1k.csv (368MB)
│   └── video_features_statistic_1k.csv (3.1GB)
└── load_data_1k.py
KuaiRand-Pure (194MB)
KuaiRand-Pure
├── data (194MB)
│   ├── log_random_4_22_to_5_08_pure.csv (83MB)
│   ├── log_standard_4_08_to_4_21_pure.csv (80MB)
│   ├── log_standard_4_22_to_5_08_pure.csv (21MB)
│   ├── user_features_pure.csv (3.4MB)
│   ├── video_features_basic_pure.csv (612KB)
│   └── video_features_statistic_pure.csv (6.3MB)
└── load_data_pure.py

1️⃣ Descriptions of the fields in log_xxx.csv

There are three log files:

  • log_random_4_22_to_5_08.csv contains all interactions resulted from random intervention.
  • log_standard_4_22_to_5_08.csv contains all interactions of standard recommendation.
  • log_standard_4_08_to_4_21.csv contains all interactions of standard recommendation for the same users in the previous two weeks (2022.04.08 ~ 2022.04.21).
Field Name: Description Type Example
user_id The ID of the video. int64 17387
video_id The ID of the video. int64 1123453
date The date of this interaction int64 20220421
hourmin The time of this interaction (format: HHSS). int64 400
time_ms The timestamp of this interaction in millisecond. int64 1650485801301
is_click A binary feedback signal. In the two-column UI, it indicates a click; In the single-column UI, it means valid_play: which equals 1 when: play_time_ms >= duration_ms if duration_ms <= 7,000 ms, or play_time_ms > 7,000 ms if duration_ms > 7,000 ms. int64 1
is_like A binary feedback signal indicating the user hit the like button. int64 0
is_follow A binary feedback signal indicating the user hit the follow the author button. int64 0
is_comment A binary feedback signal indicating the user writed comment in the comments section of this video int64 0
is_forward A binary feedback signal indicating the user forwarded this video. int64 0
is_hate A binary feedback signal indicating the user hated this video. int64 0
long_view A binary feedback signal. It equals 1 when: play_time_ms >= duration_ms if duration_ms <= 18,000 ms, or play_time_ms >=18,000 ms if duration_ms > 18,000 ms. int64 1
play_time_ms The user's view time in millisecond. int64 151024
duration_ms The video's duration time in millisecond. int64 104400
profile_stay_time The time that the user stayed in this author's profile. int64 0
comment_stay_time The time that the user stayed in the comments section of this video int64 0
is_profile_enter A binary feedback signal indicating the user enters the author profile int64 0
is_rand A binary signal indicating if this video is generated by the random intervention (i.e., a random exposed video). int64 0
tab indicating the scenario of this interaction, e.g., in the recommendation page or the main page of the App. In range of [0,14]. int64 1

2️⃣ Descriptions of the fields in user_features.csv

Field Name: Description Type Example
user_id The ID of the user. int64 25621
user_active_degree In the set of {'high_active', 'full_active', 'middle_active', 'UNKNOWN'}. str "full_active"
is_lowactive_period Is this user in its low active period int64 0
is_live_streamer Is this user a live streamer? int64 0
is_video_author Has this user uploaded any video? int64 1
follow_user_num The number of users that this user follows. int64 5
follow_user_num_range The range of the number of users that this user follows. In the set of {'0', '(0,10]', '(10,50]', '(100,150]', '(150,250]', '(250,500]', '(50,100]', '500+'} str "(0,10]"
fans_user_num The number of the fans of this user. int64 312
fans_user_num_range The range of the number of fans of this user. In the set of {'0', '[1,10)', '[10,100)', '[100,1k)', '[1k,5k)', '[5k,1w)', '[1w,10w)'} str "[100,1k)"
friend_user_num The number of friends that this user has. int64 0
friend_user_num_range The range of the number of friends that this user has. In the set of {'0', '[1,5)', '[5,30)', '[30,60)', '[60,120)', '[120,250)', '250+'} str "0"
register_days The days since this user has registered. int64 3624
register_days_range The range of the registered days. In the set of {'15-30', '31-60', '61-90', '91-180', '181-365', '366-730', '730+'}. str "730+"
onehot_feat0 An encrypted feature of the user. Each value indicate the position of "1" in the one-hot vector. Range: {0,1} int64 1
onehot_feat1 An encrypted feature. Range: {0, 1, ..., 6} int64 2
onehot_feat2 An encrypted feature. Range: {0, 1, ..., 49} int64 2
onehot_feat3 An encrypted feature. Range: {0, 1, ..., 1470} int64 1153
onehot_feat4 An encrypted feature. Range: {0, 1, ..., 14} int64 4
onehot_feat5 An encrypted feature. Range: {0, 1, ..., 33} int64 0
onehot_feat6 An encrypted feature. Range: {0, 1, 2} int64 0
onehot_feat7 An encrypted feature. Range: {0, 1, ..., 117} int64 31
onehot_feat8 An encrypted feature. Range: {0, 1, ..., 453} int64 354
onehot_feat9 An encrypted feature. Range: {0, 1, ..., 6} int64 3
onehot_feat10 An encrypted feature. Range: {0, 1, ..., 4} int64 3
onehot_feat11 An encrypted feature. Range: {0, 1, ..., 4} int64 2
onehot_feat12 An encrypted feature. Range: {0, 1} int64 1
onehot_feat13 An encrypted feature. Range: {0, 1} int64 0
onehot_feat14 An encrypted feature. Range: {0, 1} int64 0
onehot_feat15 An encrypted feature. Range: {0, 1} int64 0
onehot_feat16 An encrypted feature. Range: {0, 1} int64 0
onehot_feat17 An encrypted feature. Range: {0, 1} int64 0

3️⃣ Descriptions of the fields in video_features_basic.csv

Field Name: Description Type Example
video_id The ID of the video. int64 3784
author_id The ID of the author of this video. In the range of [0, 8839734] int64 441
video_type Type of this video (NORMAL or AD). str "NORMAL"
upload_dt Upload date of this video. str "2020-07-08"
upload_type The upload type of this video. str "ShortImport"
visible_status The visible state of this video on the APP now. int 1
video_duration The time duration of this duration (in millisecond). Int64 17200.0
server_width The width of this video on the server. int64 720
server_height The height of this video on the server. int64 1280
music_id Background music ID of this video. int64 989206467
music_type Background music type of this video. int64 4
tag A list of key categories (labels) of this video. str "12,65"

4️⃣ Descriptions of the fields in video_features_statistic.csv

‼️ Different from the basic features, the statistical features are the average statistics of the video each day over the one month. For example, in the following table, the video 9288071 have 66 counts over this one month (a video can have multiple counts each day on different scenarios, e.g., on April 8, show_cnt=80 on the main page and show_cnt=65 on the recommendation page of the App)

Field Name: Description Type Example
video_id The ID of the video. int64 9288071
counts The number of statistics. int64 66
show_cnt The number of shows of this video (averaged on each day and each scenario over the one month. This applies to all the following fields) float64 75.212
show_user_num The number of users who received the recommendation of this video. float64 66.985
play_cnt The number of plays. float64 9.409
play_user_num The number of users who plays this video. float64 8.121
play_duration The total time duration of playing this video (in millisecond). float64 93700.333
complete_play_cnt The number of complete plays. complete play: finishing playing the whole video, i.e., #(play_duration >= video_duration). float64 0.182
complete_play_user_num The number of users who perform the complete play. float64 0.182
valid_play_cnt valid play: play_duration >= video_duration if video_duration <= 7s, or play_duration > 7 if video_duration > 7s. float64 3.545
valid_play_user_num The number of users who perform the complete play. float64 3.136
long_time_play_cnt long time play: play_duration >= video_duration if video_duration <= 18s, or play_duration >=18 if video_duration > 18s. float64 1.909
long_time_play_user_num The number of users who perform the long time play. float64 1.848
short_time_play_cnt short time play: play_duration < min(3s, video_duration). float64 5.015
short_time_play_user_num The number of users who perform the short time play. float64 4.545
play_progress The average video playing ratio (=play_duration/video_duration) float64 0.016
comment_stay_duration Total time of staying in the comments section float64 2302.712
like_cnt Total likes float64 0.303
like_user_num The number of users who hit the "like" button. float64 0.303
click_like_cnt The number of the "like" resulted from double click float64 0.030
double_click_cnt The number of users who double click the video. float64 0.273
cancel_like_cnt The number of likes that are cancelled by users. float64 0.485
cancel_like_user_num The number of users who cancel their like. float64 0.485
comment_cnt The number of comments within this day. float64 0.015
comment_user_num The number of users who comment this video. float64 0.015
direct_comment_cnt The number of direct comments (depth=1). float64 0.015
reply_comment_cnt The number of reply comments (depth>1). float64 0.000
delete_comment_cnt The number of deleted comments. float64 0.015
delete_comment_user_num The number of users who delete their comments. float64 0.015
comment_like_cnt The number of comment likes. float64 0.000
comment_like_user_num The number of users who like the comments. float64 0.000
follow_cnt The number of increased follows from this video. float64 0.000
follow_user_num The number of users who follow the author of this video due to this video. float64 0.000
cancel_follow_cnt The number of decreased follows from this video. float64 0.000
cancel_follow_user_num The number of users who cancel their following of the author of this video due to this video. float64 0.000
share_cnt The times of successfully sharing this video. float64 0.000
share_user_num The number of users who succeed to share this video. float64 0.000
download_cnt The times of downloading this video. float64 0.030
download_user_num The number of users who download this video. float64 0.030
report_cnt The times of reporting this video. float64 0.000
report_user_num The number of users who report this video. float64 0.000
reduce_similar_cnt The times of reducing similar content of this video. float64 0.015
reduce_similar_user_num The number of users who choose to reduce similar content of this video. float64 0.015
collect_cnt The times of adding this video to favorite videos. float64 0.061
collect_user_num The number of users who add this video to their favorite videos. float64 0.061
cancel_collect_cnt The times of removing this video from favorite videos. float64 0.091
cancel_collect_user_num The number of users who remove this video from their favorite videos float64 0.091
direct_comment_user_num The number of users who write comments directly under this video (level=1). float64 0.015
reply_comment_user_num The number of users who reply the existing comments (level>1). float64 0.000
share_all_cnt The times of sharing this video (no need to be successful). float64 0.015
share_all_user_num The number of users who share this video (no need to be successful). float64 0.015
outsite_share_all_cnt The times of sharing this video outside Kuaishou App. float64 0.000

About

An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 74.4%
  • SCSS 25.6%