Skip to content
forked from st-tech/zr-obp

Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

License

Notifications You must be signed in to change notification settings

yoshi-09/zr-obp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Bandit Dataset & Pipeline

Overview | Installation | Usage | References | Quickstart | Open Bandit Dataset | 日本語

Overview

Open Bandit Dataset (OBD)

Open Bandit Dataset is public real-world logged bandit feedback data. The dataset is provided by ZOZO, Inc., the largest Japanese fashion e-commerce company with over 5 billion USD market capitalization (as of May 2020). The company uses multi-armed bandit algorithms to recommend fashion items to users in a large-scale fashion e-commerce platform called ZOZOTOWN. The following figure presents examples of displayed fashion items as actions.

We collected the data in a 7-days experiment in late November 2019 on three “campaigns,” corresponding to all, men', and women' items, respectively. Each campaign randomly uses either the Random algorithm or the Bernoulli Thompson Sampling (Bernoulli TS) algorithm for each user impression.

Open Bandit Pipeline (OBP)

Open Bandit Pipeline is a series of implementations of dataset preprocessing, offline bandit simulation, and evaluation of OPE estimators. This pipeline allows researchers to focus on building their OPE estimator and easily compare it with others’ methods in realistic and reproducible ways. Thus, it facilitates reproducible research on bandit algorithms and off-policy evaluation.

Topics and Tasks

Currently, Open Bandit Dataset & Pipeline facilitate evaluation and comparison related to the following research topics.

  • Bandit Algorithms: Our data include large-scale logged bandit feedback collected by the uniform random policy. Therefore, it enables the evaluation of new online bandit algorithms, including contextual and combinatorial algorithms, in a large real-world setting.

  • Off-Policy Evaluation: We present implementations of behavior policies used when collecting datasets as a part of our pipeline. Our open data also contains logged bandit feedback data generated by multiple behavior policies. Therefore, it enables the evaluation of off-policy evaluation with ground-truth for the performance of counterfactual policies.

Installation

  • You can install OBP using Python's package manager pip.
pip install obp
  • You can install OBP from source.
git clone https://github.com/st-tech/zr-obp
cd obp
python setup.py install

Requirements

  • python>=3.7.0
  • numpy>=1.18.1
  • pandas>=0.25.1
  • scikit-learn>=0.23.1
  • tqdm>=4.41.1
  • pyyaml>=5.1

Usage

We show an example of conducting an offline evaluation of the performance of BernoulliTS using Inverse Probability Weighting as an OPE estimator and the Random policy as a behavior policy. We see that only ten lines of code are sufficient to complete OPE from scratch.

# a case for implementing OPE of the BernoulliTS policy using log data generated by the Random policy
from obp.dataset import OpenBanditDataset
from obp.policy import BernoulliTS
from obp.simulator import OfflineBanditSimulator

# (1) Data loading and preprocessing
dataset = OpenBanditDataset(behavior_policy='random', campaign='women')
train, test = dataset.split_data(test_size=0.3, random_state=42)

# (2) Offline Bandit Simulation
simulator = OfflineBanditSimulator(train=train)
counterfactual_policy = BernoulliTS(
  n_actions=dataset.n_actions,
  len_list=dataset.len_list,
  random_state=42)
simulator.simulate(policy=counterfactual_policy)

# (3) Off-Policy Evaluation
estimated_policy_value = simulator.inverse_probability_weighting()

# estimated performance of BernoulliTS relative to the ground-truth performance of Random
relative_policy_value_of_bernoulli_ts = estimated_policy_value / test['reward'].mean()
print(relative_policy_value_of_bernoulli_ts) # 1.21428...

A gentle introduction with the same example can be found at quickstart. Below, we explain some important features in the example flow.

(1) Data loading and preprocessing

We prepare easy-to-use data loader for Open Bandit Dataset. We handle dataset preprocessing as well as standardized train/test splitting.

# Load and preprocess raw data in "Women" campaign collected by the Random policy
dataset = OpenBanditDataset(behavior_policy='random', campaign='women')
# Split the data into 70% training and 30% test sets
train, test = dataset.split_data(test_size=0.3, random_state=0)

print(train.keys())
# dict_keys(['n_data', 'n_actions', 'action', 'position', 'reward', 'pscore', 'X_policy', 'X_reg', 'X_user'])

Users can implement their own feature engineering in the pre_process method of OpenBanditDataset class. We show an example of implementing some new feature engineering processes in ./examples/obd/dataset.py.

Moreover, by following the interface of BaseBanditDataset class in ./obp/dataset.py, one can handle future open datasets for bandit algorithms other than our OBD.

(2) Bandit Simulation

After preparing a dataset, we now run an offline bandit simulation on the logged bandit feedback as follows.

# Define a simulator object
simulator = OfflineBanditSimulator(train=train)
# Define a counterfacutal policy, which is the Bernoulli TS policy here
counterfactual_policy = BernoulliTS(
  n_actions=dataset.n_actions,
  len_list=dataset.len_list,
  random_state=42)
# Run an offline bandit simulation on the training set
simulator.simulate(policy=counterfactual_policy)

The simulation takes BanditPolicy class and train (a dictionary storing the training set of the bandit feedback) as inputs and runs offline bandit simulation of a given bandit policy.

Users can implement their own bandit algorithms by following the interface of BaseContextFreePolicy in ./obp/policy/contextfree.py or BaseContexttualPolicy in ./obp/policy/contextual.py. We show an example of implementing new bandit algorithms in ./examples/obd/logistic_bandit.py.

(3) Off-Policy Evaluation

Our final step is off-policy evaluation (OPE), which attempts to estimate the performance of bandit algorithms using log data generated by offline bandit simulations. Our pipeline also provides an easy procedure for doing OPE as follows.

# Estimate the policy value of BernoulliTS based on actions selected by that policy
estimated_policy_value = simulator.inverse_probability_weighting()

# Comapre the estimated performance of BernoulliTS (counterfactual policy)
# with the ground-truth performance of Random (behavior policy)
relative_policy_value_of_bernoulli_ts = estimated_policy_value / test['reward'].mean()
# Our OPE procedure estimates that BernoulliTS improves Random by 21.4%
print(relative_policy_value_of_bernoulli_ts) # 1.21428...

Users can implement their own OPE estimator as a method of OfflineBanditSimator class. test['reward'].mean() is the empirical mean of factual rewards in the log and thus is the ground-truth performance of the behavior policy (the Random policy in this example.).

License

This project is licensed under the Apache License - see the LICENSE file for details.

Main Contributor

References

Papers

  1. Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129–138, 2009.

  2. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.

  3. Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. Doubly Robust Policy Evaluation and Optimization. Statistical Science, 29:485–511, 2014.

  4. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. arXiv preprint arXiv:2005.00687, 2020.

  5. Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pages 297–306, 2011.

  6. Yusuke Narita, Shota Yasui, and Kohei Yata. Off-policy Bandit and Reinforcement Learning. arXiv preprint arXiv:2002.08536, 2020.

  7. Alex Strehl, John Langford, Lihong Li, and Sham M Kakade. Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems, pages 2217–2225, 2010.

  8. Adith Swaminathan and Thorsten Joachims. The Self-normalized Estimator for Counterfactual Learning. In Advances in Neural Information Processing Systems, pages 3231–3239, 2015.

Projects

This project is strongly inspired by Open Graph Benchmark --a collection of benchmark datasets, data loaders, and evaluators for graph machine learning: [github] [project page] [paper].

About

Open Bandit Pipeline: a python library for bandit algorithms and off-policy evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%