Skip to content
forked from st-tech/zr-obp

Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

License

Notifications You must be signed in to change notification settings

TPNguyen/zr-obp

 
 

Repository files navigation

Open Bandit Pipeline: a research framework for bandit algorithms and off-policy evaluation

Docs | Google Group | Installation | Usage | Slides | Quickstart | Open Bandit Dataset | 日本語

Table of Contents

Overview

Open Bandit Dataset (OBD)

Open Bandit Dataset is a public real-world logged bandit feedback data. The dataset is provided by ZOZO, Inc., the largest Japanese fashion e-commerce company with over 5 billion USD market capitalization (as of May 2020). The company uses multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN. The following figure presents examples of displayed fashion items as actions.

Recommended fashion items as actions in ZOZOTOWN

We collected the data in a 7-days experiment in late November 2019 on three “campaigns,” corresponding to all, men's, and women's items, respectively. Each campaign randomly used either the Uniform Random policy or the Bernoulli Thompson Sampling (Bernoulli TS) policy, which was pre-trained for about a month before the data collection period.

The small size version of our data is available at ./obd. This can be used for running some examples. We release the full size version of our data at https://research.zozo.com/data.html. Please download the full size version for research uses. Please see ./obd/README.md for the description of the dataset.

Open Bandit Pipeline (OBP)

Open Bandit Pipeline is a series of implementations of dataset preprocessing, policy learning methods, OPE estimators, and the evaluation of OPE protocols. This pipeline allows researchers to focus on building their own OPE estimator and easily compare it with others’ methods in realistic and reproducible ways. Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.

Structure of Open Bandit Pipeline

Open Bandit Pipeline consists of the following main modules.

  • dataset module: This module provides a data loader for Open Bandit Dataset and a flexible interface for handling logged bandit feedback. It also provides tools to generate synthetic bandit datasets.
  • policy module: This module provides interfaces for training online and offline bandit policies. It also implements several standard policy learning methods.
  • simulator module: This module provides functions for conducting offline bandit simulation.
  • ope module: This module provides interfaces for OPE estimators. It also implements several standard and advanced OPE estimators.

Algorithms and OPE Estimators Supported

In addition to the above algorithms and estimators, the pipeline also provides flexible interfaces. Therefore, researchers can easily implement their own algorithms or estimators and evaluate them with our data and pipeline. Moreover, the pipeline provides an interface for handling real-world logged bandit feedback datasets. Thus, practitioners can combine their own datasets with the pipeline and easily evaluate bandit algorithms' performances in their settings.

Topics and Tasks

Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison related to the following research topics.

  • Bandit Algorithms: Our data include large-scale logged bandit feedback collected by the uniform random policy. Therefore, it enables the evaluation of new online bandit algorithms, including contextual and combinatorial algorithms, in a large real-world setting.

  • Off-Policy Evaluation: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by multiple different bandit policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of evaluation policies.

Please refer to to our documentation for the basic formulation of OPE.

Installation

You can install OBP using Python's package manager pip.

pip install obp

You can install OBP from source.

git clone https://github.com/st-tech/zr-obp
cd zr-obp
python setup.py install

Requirements

  • python>=3.7.0
  • matplotlib>=3.2.2
  • numpy>=1.18.1
  • pandas>=0.25.1
  • pyyaml>=5.1
  • seaborn>=0.10.1
  • scikit-learn>=0.23.1
  • scipy>=1.4.1
  • tqdm>=4.41.1

Usage

We show an example of conducting offline evaluation of the performance of BernoulliTS as an evaluation policy using Inverse Probability Weighting (IPW) and logged bandit feedback generated by the Random policy (behavior policy). We see that only ten lines of code are sufficient to complete OPE from scratch.

# a case for implementing OPE of the BernoulliTS policy using log data generated by the Random policy
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.ope import OffPolicyEvaluation, InverseProbabilityWeighting as IPW

# (1) Data loading and preprocessing
dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
bandit_feedback = dataset.obtain_batch_bandit_feedback()

# (2) Off-Policy Learning
evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

# (3) Off-Policy Evaluation
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
relative_policy_value_of_bernoulli_ts = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
print(relative_policy_value_of_bernoulli_ts)
1.198126...

A formal introduction with the same example can be found at quickstart. Below, we explain some important features in the example.

(1) Data loading and preprocessing

We prepare an easy-to-use data loader for Open Bandit Dataset.

# load and preprocess raw data in "All" campaign collected by the Random policy
dataset = OpenBanditDataset(behavior_policy='random', campaign='all')
# obtain logged bandit feedback
bandit_feedback = dataset.obtain_batch_bandit_feedback()

print(bandit_feedback.keys())
dict_keys(['n_rounds', 'n_actions', 'action', 'position', 'reward', 'pscore', 'context', 'action_context'])

Users can implement their own feature engineering in the pre_process method of obp.dataset.OpenBanditDataset class. We show an example of implementing some new feature engineering processes in custom_dataset.py.

Moreover, by following the interface of obp.dataset.BaseBanditDataset class, one can handle future open datasets for bandit algorithms other than our Open Bandit Dataset. dataset module also provide a class to generate synthetic bandit datasets.

(2) Off-Policy Learning

After preparing a dataset, we now compute the action choice probability of BernoulliTS in the ZOZOTOWN production. Then, we can use it as the evaluation policy.

# define evaluation policy (the Bernoulli TS policy here)
# by activating the `is_zozotown_prior` argument of BernoulliTS, we can replicate BernoulliTS used in ZOZOTOWN production.
evaluation_policy = BernoulliTS(
    n_actions=dataset.n_actions,
    len_list=dataset.len_list,
    is_zozotown_prior=True, # replicate the BernoulliTS policy in the ZOZOTOWN production
    campaign="all",
    random_state=12345
)
# compute the action choice probabilities by the evaluation policy using Monte Carlo simulation
# action_dist is an array of shape (n_rounds, n_actions, len_list)
# representing the distribution over actions made by the evaluation policy
action_dist = evaluation_policy.compute_batch_action_dist(
    n_sim=100000, n_rounds=bandit_feedback["n_rounds"]
)

The compute_batch_action_dist method of BernoulliTS computes the action choice probabilities based on given hyperparameters of the beta distribution. When is_zozotown_prior=True, hyperparameters used during the data collection process on the ZOZOTOWN platform are set. Otherwise, non-informative prior hyperparameters are used. action_dist is an array representing the action choice probabilities made by the evaluation policy.

Users can implement their own bandit algorithms by following the interfaces implemented in ./obp/policy/base.py.

(3) Off-Policy Evaluation

Our final step is off-policy evaluation (OPE), which attempts to estimate the performance of a new decision making policy using log data generated by a behavior, past policy. Our pipeline also provides an easy procedure for doing OPE as follows.

# estimate the policy value of BernoulliTS based on its action choice probabilities
# it is possible to set multiple OPE estimators to the `ope_estimators` argument
ope = OffPolicyEvaluation(bandit_feedback=bandit_feedback, ope_estimators=[IPW()])
estimated_policy_value = ope.estimate_policy_values(action_dist=action_dist)
print(estimated_policy_value)
{'ipw': 0.004553...} # dictionary containing policy values estimated by each OPE estimator.

# compare the estimated performance of BernoulliTS (evaluation policy)
# with the ground-truth performance of the Random policy (behavior policy)
policy_value_improvement = estimated_policy_value['ipw'] / bandit_feedback['reward'].mean()
# our OPE procedure suggests that BernoulliTS improves Random by 19.81%
print(policy_value_improvement)
1.198126...

Users can implement their own OPE estimator by following the interface of obp.ope.BaseOffPolicyEstimator class. obp.ope.OffPolicyEvaluation class summarizes and compares the policy values estimated by several different estimators. A detailed usage of this class can be found at quickstart. bandit_feedback['reward'].mean() is the empirical mean of factual rewards (on-policy estimate of the policy value) in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).

Citation

If you use our dataset and pipeline in your work, please cite our paper:

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. Large-scale Open Dataset, Pipeline, and Benchmark for Bandit Algorithms https://arxiv.org/abs/2008.07146

Bibtex:

@article{saito2020large,
  title={Large-scale Open Dataset, Pipeline, and Benchmark for Bandit Algorithms},
  author={Saito, Yuta and Shunsuke Aihara and Megumi Matsutani and Yusuke Narita},
  journal={arXiv preprint arXiv:2008.07146},
  year={2020}
}

Project Team

  • Yuta Saito (Main Contributor; Hanjuku-kaso Co., Ltd. / Tokyo Institute of Technology)
  • Shunsuke Aihara (ZOZO Technologies, Inc.)
  • Megumi Matsutani (ZOZO Technologies, Inc.)
  • Yusuke Narita (Hanjuku-kaso Co., Ltd. / Yale University)

Google Group

If you are interested in the Open Bandit Project, we can follow the updates at its google group: https://groups.google.com/g/open-bandit-project

Contact

For any question about the paper, data, and pipeline, feel free to contact: [email protected]

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

References

Papers

  1. Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009.

  2. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.

  3. Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pages 297–306, 2011.

  4. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, pages 2217–2225, 2010.

  5. Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility Traces for Off-Policy Policy Evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759–766. 2000.

  6. Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29:485–511, 2014.

  7. Adith Swaminathan and Thorsten Joachims. The Self-normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, pages 3231–3239, 2015.

  8. Dhruv Kumar Mahajan, Rajeev Rastogi, Charu Tiwari, and Adway Mitra. LogUCB: An Explore-Exploit Algorithm for Comments Recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, 6–15. 2012.

  9. Lihong Li, Wei Chu, John Langford, Taesup Moon, and Xuanhui Wang. An Unbiased Offline Evaluation of Contextual Bandit Algorithms with Generalized Linear Models. In Journal of Machine Learning Research: Workshop and Conference Proceedings, volume 26, 19–36. 2012.

  10. Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudik. Optimal and Adaptive Off-policy Evaluation in Contextual Bandits. In Proceedings of the 34th International Conference on Machine Learning, 3589–3597. 2017.

  11. Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, 1447–1456. 2018.

  12. Nathan Kallus and Masatoshi Uehara. Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning. In Advances in Neural Information Processing Systems. 2019.

  13. Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. Doubly Robust Off-policy Evaluation with Shrinkage. In Proceedings of the 37th International Conference on Machine Learning, 2020.

  14. Yusuke Narita, Shota Yasui, and Kohei Yata. Off-policy Bandit and Reinforcement Learning. arXiv preprint arXiv:2002.08536, 2020.

  15. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687, 2020.

Projects

This project is strongly inspired by Open Graph Benchmark --a collection of benchmark datasets, data loaders, and evaluators for graph machine learning: [github] [project page] [paper].

About

Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%