Skip to content

Commit

Permalink
Merge pull request st-tech#67 from Kurorororo/online-examples
Browse files Browse the repository at this point in the history
Exmaples for Online Bandit Alogirhtms with Replay Method
  • Loading branch information
usaito authored Feb 9, 2021
2 parents 0e2257f + 4cd6cf2 commit 1254058
Show file tree
Hide file tree
Showing 9 changed files with 1,218 additions and 23 deletions.
1 change: 1 addition & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ This page contains a list of example codes written with the Open Bandit Pipeline
- [`obd/`](./obd/): example implementations for evaluating standard off-policy estimators with the small sample Open Bandit Dataset.
- [`synthetic/`](./synthetic/): example implementations for evaluating several off-policy estimators with synthetic bandit datasets.
- [`multiclass/`](./multiclass/): example implementations for evaluating several off-policy estimators with multi-class classification datasets.
- [`online/`](./online/): example implementations for evaluating Replay Method with online bandit algorithms.
- [`quickstart/`](./quickstart/): some quickstart notebooks to guide the usage of the Open Bandit Pipeline.
65 changes: 65 additions & 0 deletions examples/online/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Example with Online Bandit Algorithms


## Description

Here, we use synthetic bandit datasets to evaluate OPE of online bandit algorithms.
Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy calculable with synthetic data.


## Evaluating Off-Policy Estimators

In the following, we evaluate the estimation performances of Replay Method (RM).
RM uses a subset of the logged bandit feedback data where actions selected by the behavior policy are the same as that of the evaluation policy.
Theoretically, RM is unbiased when the behavior policy is uniformly random and the evaluation policy is fixed.
However, empirically, RM works well when evaluation policies are learning algorithms.
Please refer to https://arxiv.org/abs/1003.5956 about the details of RM.


### Files
- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators by RM using synthetic bandit feedback data.

### Scripts

```bash
# run evaluation of OPE estimators with synthetic bandit data
python evaluate_off_policy_estimators.py\
--n_runs $n_runs\
--n_rounds $n_rounds\
--n_actions $n_actions\
--n_sim $n_sim\
--dim_context $dim_context\
--n_jobs $n_jobs\
--random_state $random_state
```
- `$n_runs` specifies the number of simulation runs in the experiment to estimate standard deviations of the performance of OPE estimators.
- `$n_rounds` and `$n_actions` specify the number of rounds (or samples) and the number of actions of the synthetic bandit data.
- `$dim_context` specifies the dimension of context vectors.
- `$n_sim` specifeis the simulations in the Monte Carlo simulation to compute the ground-truth policy value.
- `$evaluation_policy_name` specifeis the evaluation policy and should be one of "bernoulli_ts", "epsilon_greedy", "lin_epsilon_greedy", "lin_ts, lin_ucb", "logistic_epsilon_greedy", "logistic_ts", or "logistic_ucb".
- `$n_jobs` is the maximum number of concurrently running jobs.

For example, the following command compares the estimation performances (relative estimation error; relative-ee) of the OPE estimators using the synthetic bandit feedback data with 100,000 rounds, 30 actions, five dimensional context vectors.

```bash
python evaluate_off_policy_estimators.py\
--n_runs 20\
--n_rounds 1000\
--n_actions 30\
--dim_context 5\
--evaluation_policy_name bernoulli_ts\
--n_sim 3\
--n_jobs -1\
--random_state 12345

# relative-ee of OPE estimators and their standard deviations (lower means accurate).
# =============================================
# random_state=12345
# ---------------------------------------------
# mean std
# rm 0.202387 0.11685
# =============================================
```

The above result can change with different situations.
You can try the evaluation of OPE with other experimental settings easily.
177 changes: 177 additions & 0 deletions examples/online/evaluate_off_policy_estimators.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
import argparse
from pathlib import Path

import numpy as np
from pandas import DataFrame
from joblib import Parallel, delayed

from obp.dataset import (
SyntheticBanditDataset,
logistic_reward_function,
)
from obp.policy import (
BernoulliTS,
EpsilonGreedy,
LinEpsilonGreedy,
LinTS,
LinUCB,
LogisticEpsilonGreedy,
LogisticTS,
LogisticUCB,
)
from obp.ope import OffPolicyEvaluation, ReplayMethod
from obp.simulator import calc_ground_truth_policy_value, run_bandit_simulation


ope_estimators = [ReplayMethod()]

if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="evaluate off-policy estimators with online bandit algorithms and synthetic bandit data."
)
parser.add_argument(
"--n_runs", type=int, default=1, help="number of simulations in the experiment."
)
parser.add_argument(
"--n_rounds",
type=int,
default=10000,
help="number of rounds for synthetic bandit feedback.",
)
parser.add_argument(
"--n_actions",
type=int,
default=10,
help="number of actions for synthetic bandit feedback.",
)
parser.add_argument(
"--dim_context",
type=int,
default=5,
help="dimensions of context vectors characterizing each round.",
)
parser.add_argument(
"--n_sim",
type=int,
default=1,
help="number of simulations to calculate ground truth policy values",
)
parser.add_argument(
"--evaluation_policy_name",
type=str,
choices=[
"bernoulli_ts",
"epsilon_greedy",
"lin_epsilon_greedy",
"lin_ts",
"lin_ucb",
"logistic_epsilon_greedy",
"logistic_ts",
"logistic_ucb",
],
required=True,
help="the name of evaluation policy, bernoulli_ts, epsilon_greedy, lin_epsilon_greedy, lin_ts, lin_ucb, logistic_epsilon_greedy, logistic_ts, or logistic_ucb",
)
parser.add_argument(
"--n_jobs",
type=int,
default=1,
help="the maximum number of concurrently running jobs.",
)
parser.add_argument("--random_state", type=int, default=12345)
args = parser.parse_args()
print(args)

# configurations
n_runs = args.n_runs
n_rounds = args.n_rounds
n_actions = args.n_actions
dim_context = args.dim_context
n_sim = args.n_sim
evaluation_policy_name = args.evaluation_policy_name
n_jobs = args.n_jobs
random_state = args.random_state
np.random.seed(random_state)

# synthetic data generator with uniformly random policy
dataset = SyntheticBanditDataset(
n_actions=n_actions,
dim_context=dim_context,
reward_function=logistic_reward_function,
behavior_policy_function=None, # uniformly random
random_state=random_state,
)
# define evaluation policy
evaluation_policy_dict = dict(
bernoulli_ts=BernoulliTS(n_actions=n_actions, random_state=random_state),
epsilon_greedy=EpsilonGreedy(
n_actions=n_actions, epsilon=0.1, random_state=random_state
),
lin_epsilon_greedy=LinEpsilonGreedy(
dim=dim_context, n_actions=n_actions, epsilon=0.1, random_state=random_state
),
lin_ts=LinTS(dim=dim_context, n_actions=n_actions, random_state=random_state),
lin_ucb=LinUCB(dim=dim_context, n_actions=n_actions, random_state=random_state),
logistic_epsilon_greedy=LogisticEpsilonGreedy(
dim=dim_context, n_actions=n_actions, epsilon=0.1, random_state=random_state
),
logistic_ts=LogisticTS(
dim=dim_context, n_actions=n_actions, random_state=random_state
),
logistic_ucb=LogisticUCB(
dim=dim_context, n_actions=n_actions, random_state=random_state
),
)
evaluation_policy = evaluation_policy_dict[evaluation_policy_name]

def process(i: int):
# sample new data of synthetic logged bandit feedback
bandit_feedback = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds)
# simulate the evaluation policy
action_dist = run_bandit_simulation(
bandit_feedback=bandit_feedback, policy=evaluation_policy
)
# estimate the ground-truth policy values of the evaluation policy
# by Monte-Carlo Simulation using p(r|x,a), the reward distribution
ground_truth_policy_value = calc_ground_truth_policy_value(
bandit_feedback=bandit_feedback,
reward_sampler=dataset.sample_reward, # p(r|x,a)
policy=evaluation_policy,
n_sim=n_sim, # the number of simulations
)
# evaluate estimators' performances using relative estimation error (relative-ee)
ope = OffPolicyEvaluation(
bandit_feedback=bandit_feedback,
ope_estimators=ope_estimators,
)
relative_ee_i = ope.evaluate_performance_of_estimators(
ground_truth_policy_value=ground_truth_policy_value,
action_dist=action_dist,
)

return relative_ee_i

processed = Parallel(
backend="multiprocessing",
n_jobs=n_jobs,
verbose=50,
)([delayed(process)(i) for i in np.arange(n_runs)])
relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators}
for i, relative_ee_i in enumerate(processed):
for (
estimator_name,
relative_ee_,
) in relative_ee_i.items():
relative_ee_dict[estimator_name][i] = relative_ee_
relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6)

print("=" * 45)
print(f"random_state={random_state}")
print("-" * 45)
print(relative_ee_df[["mean", "std"]])
print("=" * 45)

# save results of the evaluation of off-policy estimators in './logs' directory.
log_path = Path("./logs")
log_path.mkdir(exist_ok=True, parents=True)
relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv")
1 change: 1 addition & 0 deletions examples/quickstart/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ This page contains a list of quickstart notebooks written with the Open Bandit P
- [`obd.ipynb`](./obd.ipynb): a quickstart guide of the Open Bandit Dataset and Pipeline.
- [`synthetic.ipynb`](./synthetic.ipynb): a quickstart guide to implement the standard off-policy learning, off-policy evaluation (OPE), and the evaluation of OPE procedures with the Open Bandit Pipeline.
- [`multiclass.ipynb`](./multiclass.ipynb): a quickstart guide to handle multi-class classification data as logged bandit feedback data for the standard off-policy learning, off-policy evaluation (OPE), and the evaluation of OPE procedures with the Open Bandit Pipeline.
- [`online.ipynb`](./online.ipynb): a quickstart guide to implement off-policy evaluation (OPE) and the evaluation of OPE procedures for online bandit algorithms with the Open Bandit Pipeline.
Loading

0 comments on commit 1254058

Please sign in to comment.