Zhiyu He, December 5, 2024
Adapted from
If not pulled, then the reward to be given is fixed.
How to sequentially select a machine (i.e., an arm) to maximize the discounted cumulative reward?
A myopic solution: $8 + 4a + 5a^2 + 3a^3 + 10a^4 + \ldots$
An arguably better solution $8 + 3a + 10a^2 + 12a^3 + 4a^4 + \ldots$
(For instance, if the discount factor is $a=0.9$, then the latter discounted sum is larger than the former)
Simple family of alternative bandit processes
$n$ independent bandit processes $B_i, i=1,\ldots,N$
state $\xi_i$ lies in a countable state space $E_i$
At decision time $t$: apply control to a specific $B_i$
Goal: maximize the infinite-horizon expected discounted sum of rewards (i.e., the payoff)
\[\sup_{\pi} \mathbb{E} \left[ \sum_{t=0}^{\infty} a^t r_{i_t} (\xi_{i_t} (t))\right]\]$a \in (0,1)$: discount factor $i_t$: the index of the bandit process chosen at time $t$
Value function
\[\quad V(\xi) = \sup_{\pi} \mathbb{E} \left[ \sum_{t=0}^{\infty} a^t r_{i_t} (\xi_{i_t} (t)) \mathrel{\Big|} \xi(0) = \xi \right]\]Bellman’s equation
\[V(\xi) = \max_{i \in \{1, \ldots, n\}} \left\{ r_i(\xi_i) + a \sum_{y \in E_i} P_i(y|\xi_i)V(\xi_1, \ldots, \xi_{i-1}, y, \xi_{i+1}, \ldots, \xi_n) \right\}\]number of states $\prod_i k_i \triangleq k$ for each state, we need to choose from $n$ actions
number of stationary policies: $k^n$
Key idea: we aim to find an index indicating the behavior of a bandit process. This behavior blends immediate rewards and future benefits.
Point of attack: construct a simple reference bandit process
Consider a discrete-time standard bandit process $\Lambda$, which gives a known reward $\lambda$ each time it is continued
If it is selected at every time, then the cumulative discounted reward is
\[\lambda(1+a+a^2+\ldots) = \frac{\lambda}{1-a}\]We want to compare a bandit process $B_i$ and $\Lambda$. Consider the following setting: we continuously apply control to $B_i$ and then follow an optimal policy.
This policy can switch to $\Lambda$ at some future time $\tau(\tau>0)$
Observation: if an optimal policy causes us to switch at time $\tau$, we will never switch back
Reason: the information on $B_i$ at time $\tau + 1$ is the same as at time $\tau$. If applying control to $\Lambda$ is optimal at time $\tau$, then it is also optimal at time $\tau + 1$ and afterwards.
The maximal payoff is
\[\sup_{\tau > 0} \mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t r_i(x_i(t)) + a^{\tau} \frac{\lambda}{1-a} \mathrel{\Big|} x_i(0) = x_i \right]\]We want to find $\lambda$ such that it is equally optimal to apply control to either of the two bandit processes initially, i.e.,
\[0 = \sup_{\tau > 0} \mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t r_i(x_i(t)) - (1 - a^{\tau}) \frac{\lambda}{1-a} \mathrel{\Big|} x_i(0) = x_i \right]\]We note that $\sum_{t=0}^{\tau-1} a^\tau = \frac{1-a^\tau}{1-a}$. Therefore, $\lambda$ satisfies the following equation
\[0 = \sup_{\tau > 0} \mathbb{E} \left[ \sum_{t=0}^{\tau-1} (a^t r_i(x_i(t)) - a^t \lambda) \mathrel{\Big|} x_i(0) = x_i \right]\]The right-hand side is the supremum of decreasing linear functions of $\lambda$. Hence, it is convex and decreasing in $\lambda$. The root exists and is unique. We obtain the following expression of $\lambda$
\[\lambda = \sup_{\tau > 0} \frac{\mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t r_i(x_i(t)) \right | x_i(0)=x_i ]}{\mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t | x_i(0)=x_i \right] }\]Another equivalent formulation
\[\sup\left\{\lambda : \sup_{\tau > 0} \mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t [r_i(x_i(t)) - \lambda] \mid x_i(0) = x_i \right] \geq 0\right\}\]Reason: the supremum $\lambda$ ensures that the expectation equals zero.
Interpretation of $\lambda$
Observation
This $\lambda$ is the so-called Gittins index
We maximize the expected discounted reward obtained from a simple family of alternative bandit processes by always continuing the bandit having greatest Gittins index. Namely, for
\[v(B_i, x_i) = \sup_{\tau > 0} \frac{\mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t r_i(x_i(t)) \right | x_i(0)=x_i ]}{\mathbb{E} \left[ \sum_{t=0}^{\tau-1} a^t | x_i(0)=x_i \right] }\]we always select the bandit process $i = \operatorname{argmax}_i v(B_i,x_i)$.
Note that the Gittins index only depends on the corresponding bandit process (or more specifically, $a, r_i, p_i$) but not on other processes.
Prevailing charge $g_i(x_i)$: fixed charge that we pay to continue playing the bandit $i$ at state $x_i$
If $g_i(x_i)$ is too large, then we will quit playing this bandit
Fair charge $\lambda_i(x_i)$: level of prevailing charge when we are indifferent at state $x_i$ between stopping and continuing for at least one more step and then optimally stopping
We give $\lambda_i$ in the previous ratio
Processes
Consider a simple family of $n$ alternative bandit processes $B_1, \ldots, B_n$. The gambler not only collects $r_{i_t}(x_{i_t}(t))$, but also pays the prevailing charge $g_{i_t,t}$ of the bandit $B_{i_t}$ chosen at time $t$. We have the following observations
The gambler cannot do better than break even (i.e., zero expected profit)
reason: a strictly positive profit is only possible if this were to happen for at least one bandit. However, we define the prevailing charge in a way (i.e., always equals the fair charge) such that no profit can be gained from any selected arm.
Implications
\[\mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} a^t \left( r_{i_t}(x_{i_t}(t)) - g_{i_t}(x_{i_t}(t)) \right) \middle| x(0) \right] \leq 0,\]which implies that
\[\mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} a^t r_{i_t}(x_{i_t}(t) \middle| x(0) \right] \leq \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} a^t g_{i_t}(x_{i_t}(t)) \middle| x(0) \right]\]Example
Implication: the Gittins index policy $\pi^*$ maximizes the right-hand side
Setup
We have $M$ independent bandit processes. Each process $i$ is endowed with a state $s_i$ belonging to a set $\mathcal{S} = {1, \ldots, N}$ and a known state transition matrix $P_i \in \mathbb{R}^{N \times N}$. At each state $s_i$, there is an associated reward $r_i \in \mathbb{R}$.
At every time $k$, we will select a bandit process and receive a reward. The state of the selected process transits to a new value, whereas the states of the remaining processes keep fixed.
Our goal is to maximize the cumulative discounted rewards.
Codes
See the Colab link here
Summary
The Gittins index theorem provides an elegant characterization of the optimal policy for the multi-armed bandit problem with Markovian reward sequences. Instead of dealing with a vast Markov Decision Process, we only need to calculate the Gittins indexes of states of each arm and then obtain the optimal policy.
Control
Setup: an unknown dynamical system and a set of candidate controllers
Challenges