Optimal learning with non-Gaussian rewards


Ding, Z. & Ryzhov, I.O., 2013. Optimal learning with non-Gaussian rewards. Proceedings of the 2013 Winter Simulation Conference , pp. 631-642.


We propose a theoretical and computational framework for approximating the optimal policy in multiarmed bandit problems where the reward distributions are non-Gaussian. We first construct a probabilistic interpolation of the sequence of discrete-time rewards in the form of a continuous-time conditional Lévy process. In the Gaussian setting, this approach allows an easy connection to Brownian motion and its convenient time-change properties. No such device is available for non-Gaussian rewards; however, we show how optimal stopping theory can be used to characterize the value of the optimal policy, using a free-boundary partial integro-differential equation, for exponential and Poisson rewards. We then solve this problem numerically to approximate the set of belief states possessing a given optimal index value, and provide illustrations showing that the solution behaves as expected.