Merge pull request st-tech#67 from Kurorororo/online-examples

Exmaples for Online Bandit Alogirhtms with Replay Method
zwcdp · Feb 9, 2021 · 1254058 · 1254058
2 parents 0e2257f + 4cd6cf2
commit 1254058
Show file tree

Hide file tree

Showing 9 changed files with 1,218 additions and 23 deletions.
diff --git a/examples/README.md b/examples/README.md
@@ -5,4 +5,5 @@ This page contains a list of example codes written with the Open Bandit Pipeline
 - [`obd/`](./obd/): example implementations for evaluating standard off-policy estimators with the small sample Open Bandit Dataset.
 - [`synthetic/`](./synthetic/): example implementations for evaluating several off-policy estimators with synthetic bandit datasets.
 - [`multiclass/`](./multiclass/): example implementations for evaluating several off-policy estimators with multi-class classification datasets.
+- [`online/`](./online/): example implementations for evaluating Replay Method with online bandit algorithms.
 - [`quickstart/`](./quickstart/): some quickstart notebooks to guide the usage of the Open Bandit Pipeline.
diff --git a/examples/online/README.md b/examples/online/README.md
@@ -0,0 +1,65 @@
+# Example with Online Bandit Algorithms
+
+
+## Description
+
+Here, we use synthetic bandit datasets to evaluate OPE of online bandit algorithms.
+Specifically, we evaluate the estimation performances of well-known off-policy estimators using the ground-truth policy value of an evaluation policy calculable with synthetic data.
+
+
+## Evaluating Off-Policy Estimators
+
+In the following, we evaluate the estimation performances of Replay Method (RM).
+RM uses a subset of the logged bandit feedback data where actions selected by the behavior policy are the same as that of the evaluation policy.
+Theoretically, RM is unbiased when the behavior policy is uniformly random and the evaluation policy is fixed.
+However, empirically, RM works well when evaluation policies are learning algorithms.
+Please refer to https://arxiv.org/abs/1003.5956 about the details of RM.
+
+
+### Files
+- [`./evaluate_off_policy_estimators.py`](./evaluate_off_policy_estimators.py) implements the evaluation of OPE estimators by RM using synthetic bandit feedback data.
+
+### Scripts
+
+```bash
+# run evaluation of OPE estimators with synthetic bandit data
+python evaluate_off_policy_estimators.py\
+    --n_runs $n_runs\
+    --n_rounds $n_rounds\
+    --n_actions $n_actions\
+    --n_sim $n_sim\
+    --dim_context $dim_context\
+    --n_jobs $n_jobs\
+    --random_state $random_state
+```
+- `$n_runs` specifies the number of simulation runs in the experiment to estimate standard deviations of the performance of OPE estimators.
+- `$n_rounds` and `$n_actions` specify the number of rounds (or samples) and the number of actions of the synthetic bandit data.
+- `$dim_context` specifies the dimension of context vectors.
+- `$n_sim` specifeis the simulations in the Monte Carlo simulation to compute the ground-truth policy value.
+- `$evaluation_policy_name` specifeis the evaluation policy and should be one of "bernoulli_ts", "epsilon_greedy", "lin_epsilon_greedy", "lin_ts, lin_ucb", "logistic_epsilon_greedy", "logistic_ts", or "logistic_ucb".
+- `$n_jobs` is the maximum number of concurrently running jobs.
+
+For example, the following command compares the estimation performances (relative estimation error; relative-ee) of the OPE estimators using the synthetic bandit feedback data with 100,000 rounds, 30 actions, five dimensional context vectors.
+
+```bash
+python evaluate_off_policy_estimators.py\
+    --n_runs 20\
+    --n_rounds 1000\
+    --n_actions 30\
+    --dim_context 5\
+    --evaluation_policy_name bernoulli_ts\
+    --n_sim 3\
+    --n_jobs -1\
+    --random_state 12345
+
+# relative-ee of OPE estimators and their standard deviations (lower means accurate).
+# =============================================
+# random_state=12345
+# ---------------------------------------------
+#         mean      std
+# rm  0.202387  0.11685
+# =============================================
+```
+
+The above result can change with different situations.
+You can try the evaluation of OPE with other experimental settings easily.
diff --git a/examples/online/evaluate_off_policy_estimators.py b/examples/online/evaluate_off_policy_estimators.py
@@ -0,0 +1,177 @@
+import argparse
+from pathlib import Path
+
+import numpy as np
+from pandas import DataFrame
+from joblib import Parallel, delayed
+
+from obp.dataset import (
+    SyntheticBanditDataset,
+    logistic_reward_function,
+)
+from obp.policy import (
+    BernoulliTS,
+    EpsilonGreedy,
+    LinEpsilonGreedy,
+    LinTS,
+    LinUCB,
+    LogisticEpsilonGreedy,
+    LogisticTS,
+    LogisticUCB,
+)
+from obp.ope import OffPolicyEvaluation, ReplayMethod
+from obp.simulator import calc_ground_truth_policy_value, run_bandit_simulation
+
+
+ope_estimators = [ReplayMethod()]
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="evaluate off-policy estimators with online bandit algorithms and synthetic bandit data."
+    )
+    parser.add_argument(
+        "--n_runs", type=int, default=1, help="number of simulations in the experiment."
+    )
+    parser.add_argument(
+        "--n_rounds",
+        type=int,
+        default=10000,
+        help="number of rounds for synthetic bandit feedback.",
+    )
+    parser.add_argument(
+        "--n_actions",
+        type=int,
+        default=10,
+        help="number of actions for synthetic bandit feedback.",
+    )
+    parser.add_argument(
+        "--dim_context",
+        type=int,
+        default=5,
+        help="dimensions of context vectors characterizing each round.",
+    )
+    parser.add_argument(
+        "--n_sim",
+        type=int,
+        default=1,
+        help="number of simulations to calculate ground truth policy values",
+    )
+    parser.add_argument(
+        "--evaluation_policy_name",
+        type=str,
+        choices=[
+            "bernoulli_ts",
+            "epsilon_greedy",
+            "lin_epsilon_greedy",
+            "lin_ts",
+            "lin_ucb",
+            "logistic_epsilon_greedy",
+            "logistic_ts",
+            "logistic_ucb",
+        ],
+        required=True,
+        help="the name of evaluation policy, bernoulli_ts, epsilon_greedy, lin_epsilon_greedy, lin_ts, lin_ucb, logistic_epsilon_greedy, logistic_ts, or logistic_ucb",
+    )
+    parser.add_argument(
+        "--n_jobs",
+        type=int,
+        default=1,
+        help="the maximum number of concurrently running jobs.",
+    )
+    parser.add_argument("--random_state", type=int, default=12345)
+    args = parser.parse_args()
+    print(args)
+
+    # configurations
+    n_runs = args.n_runs
+    n_rounds = args.n_rounds
+    n_actions = args.n_actions
+    dim_context = args.dim_context
+    n_sim = args.n_sim
+    evaluation_policy_name = args.evaluation_policy_name
+    n_jobs = args.n_jobs
+    random_state = args.random_state
+    np.random.seed(random_state)
+
+    # synthetic data generator with uniformly random policy
+    dataset = SyntheticBanditDataset(
+        n_actions=n_actions,
+        dim_context=dim_context,
+        reward_function=logistic_reward_function,
+        behavior_policy_function=None,  # uniformly random
+        random_state=random_state,
+    )
+    # define evaluation policy
+    evaluation_policy_dict = dict(
+        bernoulli_ts=BernoulliTS(n_actions=n_actions, random_state=random_state),
+        epsilon_greedy=EpsilonGreedy(
+            n_actions=n_actions, epsilon=0.1, random_state=random_state
+        ),
+        lin_epsilon_greedy=LinEpsilonGreedy(
+            dim=dim_context, n_actions=n_actions, epsilon=0.1, random_state=random_state
+        ),
+        lin_ts=LinTS(dim=dim_context, n_actions=n_actions, random_state=random_state),
+        lin_ucb=LinUCB(dim=dim_context, n_actions=n_actions, random_state=random_state),
+        logistic_epsilon_greedy=LogisticEpsilonGreedy(
+            dim=dim_context, n_actions=n_actions, epsilon=0.1, random_state=random_state
+        ),
+        logistic_ts=LogisticTS(
+            dim=dim_context, n_actions=n_actions, random_state=random_state
+        ),
+        logistic_ucb=LogisticUCB(
+            dim=dim_context, n_actions=n_actions, random_state=random_state
+        ),
+    )
+    evaluation_policy = evaluation_policy_dict[evaluation_policy_name]
+
+    def process(i: int):
+        # sample new data of synthetic logged bandit feedback
+        bandit_feedback = dataset.obtain_batch_bandit_feedback(n_rounds=n_rounds)
+        # simulate the evaluation policy
+        action_dist = run_bandit_simulation(
+            bandit_feedback=bandit_feedback, policy=evaluation_policy
+        )
+        # estimate the ground-truth policy values of the evaluation policy
+        # by Monte-Carlo Simulation using p(r|x,a), the reward distribution
+        ground_truth_policy_value = calc_ground_truth_policy_value(
+            bandit_feedback=bandit_feedback,
+            reward_sampler=dataset.sample_reward,  # p(r|x,a)
+            policy=evaluation_policy,
+            n_sim=n_sim,  # the number of simulations
+        )
+        # evaluate estimators' performances using relative estimation error (relative-ee)
+        ope = OffPolicyEvaluation(
+            bandit_feedback=bandit_feedback,
+            ope_estimators=ope_estimators,
+        )
+        relative_ee_i = ope.evaluate_performance_of_estimators(
+            ground_truth_policy_value=ground_truth_policy_value,
+            action_dist=action_dist,
+        )
+
+        return relative_ee_i
+
+    processed = Parallel(
+        backend="multiprocessing",
+        n_jobs=n_jobs,
+        verbose=50,
+    )([delayed(process)(i) for i in np.arange(n_runs)])
+    relative_ee_dict = {est.estimator_name: dict() for est in ope_estimators}
+    for i, relative_ee_i in enumerate(processed):
+        for (
+            estimator_name,
+            relative_ee_,
+        ) in relative_ee_i.items():
+            relative_ee_dict[estimator_name][i] = relative_ee_
+    relative_ee_df = DataFrame(relative_ee_dict).describe().T.round(6)
+
+    print("=" * 45)
+    print(f"random_state={random_state}")
+    print("-" * 45)
+    print(relative_ee_df[["mean", "std"]])
+    print("=" * 45)
+
+    # save results of the evaluation of off-policy estimators in './logs' directory.
+    log_path = Path("./logs")
+    log_path.mkdir(exist_ok=True, parents=True)
+    relative_ee_df.to_csv(log_path / "relative_ee_of_ope_estimators.csv")
diff --git a/examples/quickstart/README.md b/examples/quickstart/README.md
@@ -5,3 +5,4 @@ This page contains a list of quickstart notebooks written with the Open Bandit P
 - [`obd.ipynb`](./obd.ipynb): a quickstart guide of the Open Bandit Dataset and Pipeline.
 - [`synthetic.ipynb`](./synthetic.ipynb): a quickstart guide to implement the standard off-policy learning, off-policy evaluation (OPE), and the evaluation of OPE procedures with the Open Bandit Pipeline.
 - [`multiclass.ipynb`](./multiclass.ipynb): a quickstart guide to handle multi-class classification data as logged bandit feedback data for the standard off-policy learning, off-policy evaluation (OPE), and the evaluation of OPE procedures with the Open Bandit Pipeline.
+- [`online.ipynb`](./online.ipynb): a quickstart guide to implement off-policy evaluation (OPE) and the evaluation of OPE procedures for online bandit algorithms with the Open Bandit Pipeline.