强化学习基础

时间：2018-08-05 14:16:44 阅读：221 评论：0 收藏：0 [点我收藏+]

标签：nsa 概率强化学习 add info i+1 tree lambda 参考文献

概念

强化学习需要学习一个从环境状态到智能体行动的映射，称为智能体的一个策略，使得强化回报最大化。其环境通常采用 MDP 来定义。

马尔可夫决策过程：$MDP = \{ S, A, P, R \} $

状态转移的回报函数$R: S\times A\times S \to R$
状态转移的概率$P: S\times S\times A \to [0,1],\forall s\in S, \forall a\in A \sum_{s‘\in S}P(s‘|s,a)=1$
部分可观测 MDP ：MDP+O+P(O|S)，O 为观测结果集合

一个平稳策略是一个确定的、时间无关函数$\pi:S\to A$

$Q^\pi(s,a)=\sum_{s‘\in S}P(s‘|s,a)[R(s,a,s‘)+\gamma V^\pi(s‘)]$，$\gamma$为折扣因子

$V^\pi(s)=Q^\pi(s,\pi(s)),V^\pi(s)$是状态s下的回报期望值，$Q^\pi(s,a)$是状态s下采取行动 a的回报期望值。

最优策略*：每个状态选择最大回报的动作。

$V^*(s)=\max_aQ^*(s,a),\pi^*(s)=\arg\max_aQ^*(s,a)$

动态规划

已知P时，强化学习为确定的动态规划算法

值迭代：从V=0值开始，得到Q，最大化$\pi$，进而得到V的新值。
策略迭代：从随机策略$\pi$和V=0值开始，解V或Q方程得到V与Q的新值，再计算新的策略。

未知 P 时，可用随机算法估计 P ，两个等价的逼近公式。

估计值公式：$A_k = \frac{1}{k}\sum v_k=A_{k-1} +\alpha_k(v_k-A_{k-1}),\alpha_k=\frac{1}{k},TD= v_k-A_{k-1}$称为TD误差。
Robbins-Monro 随机逼近公式：$A_k =(1-\alpha_k)A_{k-1}+\alpha_kv_k$

$Q(\lambda=0)$学习，$\lambda$为步数。重复以下步骤：

选择执行一个动作a。为了保留探索的机会，$1-\epsilon$概率选择非最大值。
观察回报r和状态 s‘
$Q(s,a)\leftarrow Q(s,a)+\alpha(r+\max_{a‘}Q(s‘,a‘)-Q(s,a))$，策略a‘被选之后不一定执行，称为off-policy
采用值迭代时，为$TD(\lambda):V(s_t)=V(s_t)+\alpha(r_{t+1}+\gamma V(s_{t+1}))$
$s\leftarrow s‘$

$SARSA(\lambda=0)$学习，重复以下步骤：

执行一个动作a，观察回报r和状态 s‘
利用Q 的策略选择 a‘
$Q(s,a)\leftarrow Q(s,a)+\alpha(r+Q(s‘,a‘)-Q(s,a))$，策略a‘被选之后必然会执行，称为on-policy
$s\leftarrow s‘, a\leftarrow a‘$

扩展模型

随着状态空间维数的增加，动作空间的连续，计算复杂度指数增长，因此需要 V/Q 的低代价版本，通常的解决方案是函数逼近。

策略梯度方法

连续的动作空间使得$\max_{a‘}Q(s‘,a‘)$变得不切实际，PG采用可导函数逼近$Q$和$\pi$。

把策略随机化、参数化：$\pi(s,a,\theta)=P{a_t=a|s_t=s,\theta}$
长期回报函数：$\rho(\pi)=E[\sum_t\gamma^{t-1}r_t|s_0,\pi]=\sum_a\pi(s_0,a)Q^\pi(s_0,a)$
梯度定理：$\frac{\partial\rho}{\partial\theta}=\sum_sd^\pi(s)\sum_a\frac{\partial\pi(s,a)}{\partial\theta}Q^\pi(s,a),d^\pi(s)=\sum_t\gamma^tP\{s_t=s|s_0,\pi\}$

找到逼近$Q^\pi$的函数：$f_w:S\times A\to R$

然后通过梯度下降法，找到长期回报函数的极值：$\lim_{k\to\infty}\frac{\partial\rho(\pi_k)}{\partial\theta}=0$

DQN

DQN采用神经网络逼近$Q^\pi$函数，$f_w:S\to A\times R$

损失函数为：$L_i(\theta)=E_{s,a\sim\rho}[y_i-Q(s,a;\theta_i)]^2,y_i=E_{s‘\sim\epsilon}[r+\gamma\max_{a‘}Q(s‘,s‘;\theta_{i-1}]$

算法特点：在Q算法中更新Q的时候，从缓冲池中，提取小批量序列计算Q‘；并且每C步用Q‘更新Q。

AlphaGo

技术特点

策略网络的有监督学习，得到权重初值
策略网络的强化学习，只有最后一步有回报，然后强化每一步的策略
基于策略网络，通过强化学习得到估值网络
采用蒙特卡洛树来采样。

AlphaGo Zero

放弃有监督学习，采用单一网络估计策略与价值，采用蒙特卡洛树来采样。

DDPG

针对连续动作空间

回报函数：$J(\pi_\theta)=\int_S\rho(s)\int_A\pi_\theta(s,a)r(s,a)dads=E_{s\sim\rho^\pi,a\sim\pi_\theta}[r(s,a)$
DPG定理：$J(\mu_\theta)=\int_S\rho^\mu(s)r(s,\mu_\theta(s))ds=E_{s\sim\rho^\mu}[r(s,\mu_theta(s))]$

采用了两个可优化部件：

Actor函数$\mu$近似$\pi$，利用采样梯度优化。
Critic 网络近似Q，损失函数：$L=\frac{1}{N}\sum_i(y_i-Q(s_i,a_i|\theta^Q)^2,y_i=r_i+\gamma Q‘(s_{i+1},\mu‘(s_{i+1}|\theta^{\mu‘})|\theta^{Q‘})$

DDPG是采用了DQN 的训练技术的 DPG。

参考文献

Mozer S, M C, Hasselmo M. Reinforcement Learning: An Introduction[J]. IEEE Transactions on Neural Networks, 1992, 8(3-4):225-227.
Sutton R S. Policy Gradient Methods for Reinforcement Learning with Function Approximation[J]. Submitted to Advances in Neural Information Processing Systems, 1999, 12:1057-1063.
Simon Haykin, Neural Networks and Learning Machines (the 3rd edition), Pearson Eduction, Inc, 2009
David L. Poole and Alan K. Mackworth: Artificial Intelligence: Foundations of Computational Agents, Cambridge University Press, 2010
Mnih V, Kavukcuoglu K, Silver D, et al. Playing Atari with Deep Reinforcement Learning[J]. Computer Science, 2013.
Silver D, Lever G, Heess N, et al. Deterministic policy gradient algorithms[C]// International Conference on International Conference on Machine Learning. JMLR.org, 2014:387-395.
Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. Computer Science, 2015, 8(6):A187.
Mnih V, Badia A P, Mirza M, et al. Asynchronous Methods for Deep Reinforcement Learning[J]. 2016.
Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search.[J]. Nature, 2016, 529(7587):484-489.
Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676):354-359.

强化学习基础

标签：nsa 概率强化学习 add info i+1 tree lambda 参考文献

原文地址：https://www.cnblogs.com/liuyunfeng/p/9387648.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行