M pM p 0 ?h p 0 ?i JM M ;??PLOS ONE | DOI
M pM p 0 ?h p 0 ?i JM M ;??PLOS ONE | DOI:10.1371/journal.pone.0157088 June 15,4 /Benchmarking for Bayesian Reinforcement Learningwhere p 0 ?is the algorithm trained offline on p0 . In our Bayesian RL setting, we want to M M find the algorithm ?which maximises JpMM for the hp0 ; pM i experiment: M p?2 arg maxp p 0 ?p 0 ?JpMM :??In addition to the performance criterion, we also measure the empirical computation time. In practice, all problems are subject to time constraints. Hence, it is important to take this parameter into account when comparing different algorithms.3.2 The experimental protocolIn practice, we can only sample a finite number of trajectories, and must rely on estimators to compare algorithms. In this section our experimental protocol is described, which is based on our comparison U0126-EtOH site criterion for BRL and provides a detailed computation time analysis. An experiment is defined by (i) a prior distribution p0 and (ii) a test distribution pM . Given M these, an agent is evaluated as follows: 1. Train offline on p0 . M 2. Sample N MDPs from the test distribution pM .p ? p ?3. For each sampled MDP M, compute estimate J M M of JM M .0p ?4. Use these values to compute an estimate J pM M . To estimate JMp 0 ?M, the expected return of agent trained offline on p0 , one trajectory is Mp 0 ?p 0 ?sampled on the MDP M, and the cumulated return is computed Mi M ?RM M 0 ? J To estimate this return, each trajectory is truncated after T steps. Therefore, given an MDPp ? p ?M and its initial state x0, we AZD1722 web observe R M M 0 ? an approximation of RM M 0 ?0p ?R M M 0 ??T X t?gt rt :If Rmax denotes the maximal instantaneous reward an agent can receive when interacting with an MDP drawn from pM , then choosing T as guarantees the approximation error is bounded by > 0: 7 6 6 log ?? ?7 6 Rmax 7 5: T? log g = 0.01 is set for all experiments, as a compromise between measurement accuracy and computation time. Finally, to estimate our comparison criterion JpMM , the empirical average of the algorithm performance is computed over N different MDPs, sampled from pM : 0 1 X p 0 ?1 X p 0 ? p ?J Mi M ???R M 0 ?J pMM ?N 0 i