We consider a general contextual bandit setting. Let r \in [0, R_{\mathrm{max}}] denote a reward or outcome variable (e.g., whether a fashion item as an action results in a click). We let x \in \calX be a context vector (e.g., the user's demographic profile) that the decision maker observes when picking an action. Rewards and contexts are sampled from the unknown probability distributions p (r \mid x, a) and p(x), respectively. Let \calA:=\{0,\ldots,m\} be a finite set of m+1 actions. We call a function \pi: \calX \rightarrow \Delta(\calA) a policy. It maps each context x \in \calX into a distribution over actions, where \pi (a \mid x) is the probability of taking action a given x.
Let \calD := \{(x_t,a_t,r_t)\}_{t=1}^{T} rounds of observations. a_t is a discrete variable indicating which action in \calA is chosen in round t. r_t and x_t denote the reward and the context observed in round t, respectively. We assume that a logged bandit feedback is generated by a behavior policy \pi_b as follows:
\{(x_t,a_t,r_t)\}_{i=1}^{T} \sim \prod_{i=1}^{T} p(x_t) \pi_b (a_t \mid x_t) p(r_t \mid x_t, a_t),
where each context-action-reward triplets are sampled independently from the product distribution. Note that we assume a_t is independent of r_t conditional on x_t.
We let \pi(x,a,r) := p(x) \pi (a \mid x) p(r \mid x, a) be the product distribution by a policy \pi. For a function f(x,a,r), we use \E_{\calD} [f] := |\calD|^{-1} \sum_{(x_t, a_t, r_t) \in \calD} f(x_t, a_t, r_t) to denote its empirical expectation over T observations in \calD. Then, for a function g(x,a), we let g(x,\pi) := \E_{a \sim \pi(a|x)}[g(x,a) \mid x]. We also use q(x,a) := \E_{r \sim p(r|x,a)} [ r \mid x, a ] to denote the mean reward function.
We are interested in using the historical logged bandit data to estimate the following policy value of any given evaluation policy \pi_e which might be different from \pi_b:
V (\pi_e) := \E_{(x,a,r) \sim \pi_e (x,a,r)} [r] .
where the last equality uses the independence of A and Y(\cdot) conditional on X and the definition of \pi_b(\cdot|X). We allow the evaluation policy \pi_e to be degenerate, i.e., it may choose a particular action with probability 1. Estimating V(\pi_e) before implementing \pi_e in an online environment is valuable because \pi_e may perform poorly and damage user satisfaction. Additionally, it is possible to select an evaluation policy that maximizes the policy value by comparing their estimated performances without having additional implementation cost.