码迷,mamicode.com
首页 > 其他好文 > 详细

93 Fuzzy Qlearning and Dynamic Fuzzy Qlearning

时间:2019-10-31 21:25:04      阅读:74      评论:0      收藏:0      [点我收藏+]

标签:orm   receives   mat   feed   more   cap   for   matrix   sig   

Introduction

In the reinforcement learning paradigm, an agent receives from its envrionment a scalar reward value called \(reinforcement\). This feedback is rather poor: it can be loolean (true, false) or fuzzy (bad, fair, very good, ...), and, moreover, ti may be delayed. A sequence of control actions is often executed before receiving any information on the quality of the whole sequence. Therefore, it is difficult to evaluate the contribution of on individual action.

Q-learning

Q-learning is a form of competitve learning which provides agents with the capability of learning to act optimally by evaluatiing the consequences of actons. Q-learning keeps a Q-function which attempts to estimate the discounted future reinforcement fo taking actions from given states. A Q-function is a mapping from state-action pairs to predicted reinforcement. In order to explain the method, we adopt the implementation proposed by Bersini.

  1. The state space, \(U\subset R^{n}\), is partitioned into hypercubes or cells. Among these cells we can distinguish: (a) one particular cell, called the target cell, to which the quality value +1 is assigned, (b) a subset of cells, called viability zone, that the process must not leave. The quality value for viability zone is 0. This notion of viability zone comes from Aubin and eliminates strong constraints on a reference trajectory for the process. (c) the remaining cells, called failure zone, with the quality value -1.
  2. In each cell, a set of \(J\) agents compete to control a process. With \(M\) cells, the agent \(j\), $j \in {1,\ldots, J} $, acting in cell \(c\), \(c\in{1,\ldots,M}\), is characterized by its quality value Q[c,j]. The probability to agent \(j\) in cell \(c\) will be selected is given by a Boltzmann distribution.
  3. The selected agent controls the process as long as the process stays in the cell. When the process leaves the cell \(c\) to get into a cell \(c^{'}\), at time step \(t\), another agent is selected for cell \(c^{'}\) and the Q-function of the previous agent is incremented by:
    \[\Delta Q[c,j] = \alpha \{ r(t)+ \gamma \underset{k}{\max}Q[k,c^{'}]-Q[c,j] \} \]
    where \(\alpha\) is the learning rate \((\alpha <1)\), \(\gamma\) the discount rate \(\gamma <1\) and \(r(t)\) the reinforcement.
    \[r(t)\left\{\begin{matrix} +1 &{\rm if} \ c^{'} \ {\rm is \ the\ target\ cell\ reward } \\ 0& { \rm if} \ c^{'} { \rm is \ in \ the \ viability \ zone}\\ -1& {\rm if }\ c^{'} {\rm is \ in \ the \ failure \ zone \ (punishment)} \end{matrix}\right.\]

93 Fuzzy Qlearning and Dynamic Fuzzy Qlearning

标签:orm   receives   mat   feed   more   cap   for   matrix   sig   

原文地址:https://www.cnblogs.com/jtailong/p/11773689.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!