forked from st-tech/zr-obp
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request st-tech#67 from Kurorororo/online-examples
Exmaples for Online Bandit Alogirhtms with Replay Method
- Loading branch information
Showing
9 changed files
with
1,218 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Example with Online Bandit Algorithms | ||
|
||
|
||
## Description | ||
|
||
Here, we use synthetic bandit datasets to evaluate OPE of online bandit algorithms. | ||
Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy calculable with synthetic data. | ||
|
||
|
||
## Evaluating Off-Policy Estimators | ||
|
||
In the following, we evaluate the estimation performances of Replay Method (RM). | ||
RM uses a subset of the logged bandit feedback data where actions selected by the behavior policy are the same as that of the evaluation policy. | ||
Theoretically, RM is unbiased when the behavior policy is uniformly random and the evaluation policy is fixed. | ||
However, empirically, RM works well when evaluation policies are learning algorithms. | ||
Please refer to https://arxiv.org/abs/1003.5956 about the details of RM. | ||
|
||
|
||
### Files | ||
- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators by RM using synthetic bandit feedback data. | ||
|
||
### Scripts | ||
|
||
```bash | ||
# run evaluation of OPE estimators with synthetic bandit data | ||
python evaluate_off_policy_estimators.py\ | ||
--n_runs $n_runs\ | ||
--n_rounds $n_rounds\ | ||
--n_actions $n_actions\ | ||
--n_sim $n_sim\ | ||
--dim_context $dim_context\ | ||
--n_jobs $n_jobs\ | ||
--random_state $random_state | ||
``` | ||
- `$n_runs` specifies the number of simulation runs in the experiment to estimate standard deviations of the performance of OPE estimators. | ||
- `$n_rounds` and `$n_actions` specify the number of rounds (or samples) and the number of actions of the synthetic bandit data. | ||
- `$dim_context` specifies the dimension of context vectors. | ||
- `$n_sim` specifeis the simulations in the Monte Carlo simulation to compute the ground-truth policy value. | ||
- `$evaluation_policy_name` specifeis the evaluation policy and should be one of "bernoulli_ts", "epsilon_greedy", "lin_epsilon_greedy", "lin_ts, lin_ucb", "logistic_epsilon_greedy", "logistic_ts", or "logistic_ucb". | ||
- `$n_jobs` is the maximum number of concurrently running jobs. | ||
|
||
For example, the following command compares the estimation performances (relative estimation error; relative-ee) of the OPE estimators using the synthetic bandit feedback data with 100,000 rounds, 30 actions, five dimensional context vectors. | ||
|
||
```bash | ||
python evaluate_off_policy_estimators.py\ | ||
--n_runs 20\ | ||
--n_rounds 1000\ | ||
--n_actions 30\ | ||
--dim_context 5\ | ||
--evaluation_policy_name bernoulli_ts\ | ||
--n_sim 3\ | ||
--n_jobs -1\ | ||
--random_state 12345 | ||
|
||
# relative-ee of OPE estimators and their standard deviations (lower means accurate). | ||
# ============================================= | ||
# random_state=12345 | ||
# --------------------------------------------- | ||
# mean std | ||
# rm 0.202387 0.11685 | ||
# ============================================= | ||
``` | ||
|
||
The above result can change with different situations. | ||
You can try the evaluation of OPE with other experimental settings easily. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,177 @@ | ||
import argparse | ||
from pathlib import Path | ||
|
||
import numpy as np | ||
from pandas import DataFrame | ||
from joblib import Parallel, delayed | ||
|
||
from obp.dataset import ( | ||
SyntheticBanditDataset, | ||
logistic_reward_function, | ||
) | ||
from obp.policy import ( | ||
BernoulliTS, | ||
EpsilonGreedy, | ||
LinEpsilonGreedy, | ||
LinTS, | ||
LinUCB, | ||
LogisticEpsilonGreedy, | ||
LogisticTS, | ||
LogisticUCB, | ||
) | ||
from obp.ope import OffPolicyEvaluation, ReplayMethod | ||
from obp.simulator import calc_ground_truth_policy_value, run_bandit_simulation | ||
|
||
|
||
ope_estimators = [ReplayMethod()] | ||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser( | ||
description="evaluate off-policy estimators with online bandit algorithms and synthetic bandit data." | ||
) | ||
parser.add_argument( | ||
"--n_runs", type=int, default=1, help="number of simulations in the experiment." | ||
) | ||
parser.add_argument( | ||
"--n_rounds", | ||
type=int, | ||
default=10000, | ||
help="number of rounds for synthetic bandit feedback.", | ||
) | ||
parser.add_argument( | ||
"--n_actions", | ||
type=int, | ||
default=10, | ||
help="number of actions for synthetic bandit feedback.", | ||
) | ||
parser.add_argument( | ||
"--dim_context", | ||
type=int, | ||
default=5, | ||
help="dimensions of context vectors characterizing each round.", | ||
) | ||
parser.add_argument( | ||
"--n_sim", | ||
type=int, | ||
default=1, | ||
help="number of simulations to calculate ground truth policy values", | ||
) | ||
parser.add_argument( | ||
"--evaluation_policy_name", | ||
type=str, | ||
choices=[ | ||
"bernoulli_ts", | ||
"epsilon_greedy", | ||
"lin_epsilon_greedy", | ||
"lin_ts", | ||
"lin_ucb", | ||
"logistic_epsilon_greedy", | ||
"logistic_ts", | ||
"logistic_ucb", | ||
], | ||
required=True, | ||
help="the name of evaluation policy, bernoulli_ts, epsilon_greedy, lin_epsilon_greedy, lin_ts, lin_ucb, logistic_epsilon_greedy, logistic_ts, or logistic_ucb", | ||
) | ||
parser.add_argument( | ||
"--n_jobs", | ||
type=int, | ||
default=1, | ||
help="the maximum number of concurrently running jobs.", | ||
) | ||
parser.add_argument("--random_state", type=int, default=12345) | ||
args = parser.parse_args() | ||
print(args) | ||
|
||
# configurations | ||
n_runs = args.n_runs | ||
n_rounds = args.n_rounds | ||
n_actions = args.n_actions | ||
dim_context = args.dim_context | ||
n_sim = args.n_sim | ||
evaluation_policy_name = args.evaluation_policy_name | ||
n_jobs = args.n_jobs | ||
random_state = args.random_state | ||
np.random.seed(random_state) | ||
|
||
# synthetic data generator with uniformly random policy | ||
dataset = SyntheticBanditDataset( | ||
n_actions=n_actions, | ||
dim_context=dim_context, | ||
reward_function=logistic_reward_function, | ||
behavior_policy_function=None, # uniformly random | ||
random_state=random_state, | ||
) | ||
# define evaluation policy | ||
evaluation_policy_dict = dict( | ||
bernoulli_ts=BernoulliTS(n_actions=n_actions, random_state=random_state), | ||
epsilon_greedy=EpsilonGreedy( | ||
n_actions=n_actions, epsilon=0.1, random_state=random_state | ||
), | ||
lin_epsilon_greedy=LinEpsilonGreedy( | ||
dim=dim_context, n_actions=n_actions, epsilon=0.1, random_state=random_state | ||
), | ||
lin_ts=LinTS(dim=dim_context, n_actions=n_actions, random_state=random_state), | ||
lin_ucb=LinUCB(dim=dim_context, n_actions=n_actions, random_state=random_state), | ||
logistic_epsilon_greedy=LogisticEpsilonGreedy( | ||
dim=dim_context, n_actions=n_actions, epsilon=0.1, random_state=random_state | ||
), | ||
logistic_ts=LogisticTS( | ||
dim=dim_context, n_actions=n_actions, random_state=random_state | ||
), | ||
logistic_ucb=LogisticUCB( | ||
dim=dim_context, n_actions=n_actions, random_state=random_state | ||
), | ||
) | ||
evaluation_policy = evaluation_policy_dict[evaluation_policy_name] | ||
|
||
def process(i: int): | ||
# sample new data of synthetic logged bandit feedback | ||
bandit_feedback = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds) | ||
# simulate the evaluation policy | ||
action_dist = run_bandit_simulation( | ||
bandit_feedback=bandit_feedback, policy=evaluation_policy | ||
) | ||
# estimate the ground-truth policy values of the evaluation policy | ||
# by Monte-Carlo Simulation using p(r|x,a), the reward distribution | ||
ground_truth_policy_value = calc_ground_truth_policy_value( | ||
bandit_feedback=bandit_feedback, | ||
reward_sampler=dataset.sample_reward, # p(r|x,a) | ||
policy=evaluation_policy, | ||
n_sim=n_sim, # the number of simulations | ||
) | ||
# evaluate estimators' performances using relative estimation error (relative-ee) | ||
ope = OffPolicyEvaluation( | ||
bandit_feedback=bandit_feedback, | ||
ope_estimators=ope_estimators, | ||
) | ||
relative_ee_i = ope.evaluate_performance_of_estimators( | ||
ground_truth_policy_value=ground_truth_policy_value, | ||
action_dist=action_dist, | ||
) | ||
|
||
return relative_ee_i | ||
|
||
processed = Parallel( | ||
backend="multiprocessing", | ||
n_jobs=n_jobs, | ||
verbose=50, | ||
)([delayed(process)(i) for i in np.arange(n_runs)]) | ||
relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators} | ||
for i, relative_ee_i in enumerate(processed): | ||
for ( | ||
estimator_name, | ||
relative_ee_, | ||
) in relative_ee_i.items(): | ||
relative_ee_dict[estimator_name][i] = relative_ee_ | ||
relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6) | ||
|
||
print("=" * 45) | ||
print(f"random_state={random_state}") | ||
print("-" * 45) | ||
print(relative_ee_df[["mean", "std"]]) | ||
print("=" * 45) | ||
|
||
# save results of the evaluation of off-policy estimators in './logs' directory. | ||
log_path = Path("./logs") | ||
log_path.mkdir(exist_ok=True, parents=True) | ||
relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.