Active Learning in POMDPs.ppt
《Active Learning in POMDPs.ppt》由会员分享,可在线阅读,更多相关《Active Learning in POMDPs.ppt(67页珍藏版)》请在麦多课文档分享上搜索。
1、Active Learning in POMDPs,Robin JAULMES Supervisors: Doina PRECUP and Joelle PINEAU McGill University rjaulmcs.mcgill.ca,Outline,1) Partially Observable Markov Decision Processes (POMDPs) 2) Active Learning in POMDPs 3) The MEDUSA algorithm.,Markov Decision Processes(MDPs),Markov Decision Processes:
2、 States S Actions A Probabilistic transitions P(s|s,a) Immediate Rewards R(s,a) A discount factor The current state is always perfectly observed.,Partially Observable Markov Decision Processes (POMDPs),A POMDP has: States S Actions A Probabilistic transitions Immediate Rewards A discount factor Obse
3、rvations Z Observation probabilities An initial belief b0,Applications of POMDPs,The ability to render environments in which the state is not fully observed can allow applications in: Dialogue management Vision Robot navigation High-level control of robots Medical diagnosis Network maintenance,A POM
4、DP example: The Tiger Problem,The Tiger Problem,Description: 2 states: Tiger_Left, Tiger_Right 3 actions: Listen, Open_Left, Open_Right 2 observations: Hear_Left, Hear_Right Rewards are: -1 for the Listen action -100 for the Open_Left in the Tiger_Left state +10 for the Open_Right in the Tiger_Left
5、state,The Tiger Problem,Furthermore: The hear action does not change the state The open action puts the tiger behind any door with 50% chance. The open action leads to A a useless observation (50% hear_left, 50% hear_right) The hear action gives the correct information 85% of the time.,Solving a POM
6、DP,To solve a POMDP is to find, for any action/observation history, the action that maximizes the expected discounted reward:,The belief state,Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief is a probability distribution over the states. Dim(b
7、) = |S|-1,The belief space,Here is a representation of the belief space when we have two states (s0,s1),The belief space,Here is a representation of the belief state when we have three states (s0,s1,s2),The belief space,Here is a representation of the belief state when we have four states (s0,s1,s2,
8、s3),The belief space,The belief space is continuous but we only visit a countable number of belief points.,The Bayesian update,Value Function in POMDPs,We will compute the value function over the belief space. Hard: the belief space is continuous ! But we can use a property of the optimal value func
9、tion for a finite horizon: it is piecewise-linear and convex. We can represent any finite-horizon solution by a finite set of alpha-vectors. V(b) = max_s (s)b(s),Alpha-Vectors,They are a set of hyperplanes which define the belief function. At each belief point the value function is equal to the hype
10、rplane with the highest value.,Value Iteration in POMDPs,Value iteration: Initialize value function (horizon 1 value) V(b) = max_a _s R(s,a) b(s) This produces 1 alpha-vector per action. Compute the value function at the next iteration using Bellmans equation: V(b)= max_a _s R(s,a)b(s)+_sT(s,a,s)O(s
11、,a,z)(s),PBVI: Point-Based Value Iteration,Always keep a bounded number of alpha vectors. Use value iteration starting from belief points on a grid to produce new sets of alpha vectors. Stop after n steps (finite horizon). The solution is approximate but found in a reasonable amount of time and memo
12、ry. Good tradeoff between computation time and qualitySee Pineau et al., 2003,Learning a POMDP,What happens if we dont know for sure the model of the POMDP? We have to learn it. The two solutions in the literature are: EM-based approaches (prone to local minima) History-based approach (require of th
13、e order of 1,000,000 samples for 2 state problems) Singh et al. 2003,Active Learning,In an Active Learning Problem the learner has the ability to influence its training data. The learner asks for what is the most useful given its current knowledge. Methods to find the most useful query have been sho
14、wn by Cohn et al. (95),Active Learning (Cohn et al. 95),Their method, used for function approximation tasks, is based on finding the query that will minimize the estimated variance of the learner. They showed how this could be done exactly: For a mixture of Gaussians model. For locally weighted regr
15、ession.,Applying Active Learning to POMDPs,We will suppose in this work that we have an oracle to determine the hidden state of a system on request. However, this action is costly and we want to use it as little as possible. In this setting, the active learning query will be to ask for the hidden st
16、ate.,Applying Active Learning to POMDPs,We propose two solutions: Integrate the model uncertainty and the query possibility inside the POMDP framework to take advantage of existing algorithms. The MEDUSA algorithm. It uses a Dirichlet distribution over possible models to determine which actions to t
17、ake and which queries to ask.,Decision-Theoretic Model Learning,We want to integrate in the POMDP model the fact that: We have only a rough estimation of its parameters.The agent can query the hidden state.These queries should not be used too often, and only used to learn.,Decision-Theoretic Model L
18、earning,So we modify our POMDP: For each uncertain parameter we introduce an additional state feature. This feature is discretized into n levels. At initialization we are uniformly distributed among these n groups of states but we remain in this group as the transitions occur. We introduce a query a
19、ction that returns the hidden state. This action is attached to a negative reward Rq.Then we solve this new POMDP using the usual methods.,Decision-Theoretic Model Learning,D-T Planning: Results,DT-Planning: Conclusions,Theoretically sound, but: The results are very sensitive to the value of the que
20、ry penalty, which is therefore very difficult to establish. The number of states becomes exponential in the number of uncertain parameters ! This increases greatly the complexity of the problem. With MEDUSA, we leave the theoretical guarantees of optimality to get a tractable algorithm.,MEDUSA: The
21、main ideas,Markovian Exploration with Decision based on the Use of Samples Algorithm Use Dirichlet distributions to represent current knowledge about the parameters of the POMDP model. Sample models from the distribution. Use models to take actions that could be good. Use queries to improve current
22、knowledge.,Dirichlet distributions,Let X 0 ; 1 ;2 ; . N. X is drawn from a multinomial distribution of parameters (1,. N) iff p(X=i)= iThe Dirichlet distribution is a distribution of multinomial distribution parameters (of (1,. N) tuples such that i 0 and i = 1),Dirichlet distributions,Dirichlet dis
23、tributions have parameters s.t. i0. We can sample from Dirichlet distributions by using Gamma distributions. The most likely parameters in a Dirichlet distribution are the following:,Dirichlet distributions,We can also compute the probability of multinomial distribution parameters according to the D
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- ACTIVELEARNINGINPOMDPSPPT
