（转）Attention

Index

参考列表

Survey on Advanced Attention-based Models
Recurrent Models of Visual Attention (2014.06.24)
Recurrent Model of Visual Attention (blog)
https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rva.lua
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015.02.10)
Soft Attention Mechanism for Neural Machine Translation
DRAW: A Recurrent Neural Network For Image Generation (2015.05.20)
Teaching Machines to Read and Comprehend (2015.06.04)
Learning Wake-Sleep Recurrent Attention Models (2015.09.22)
Action Recognition using Visual Attention (2015.10.12)
Recurrent Convolutional Neural Network for Object Recognition (2015)
Understanding Deep Architectures using a Recursive Convolutional Network (2014.2.19)
MULTIPLE OBJECT RECOGNITION WITH VISUAL ATTENTION (2015.04.23)
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016.03.09)
https://github.com/Element-Research/rnn/blob/master/examples/recurrent-visual-attention.lua (code)

Attention

在引入Attention(注意力)之前，图像识别或语言翻译都是直接把完整的图像或语句直接塞到一个输入，然后给出输出。
而且图像还经常缩放成固定大小，引起信息丢失。
而人在看东西的时候，目光沿感兴趣的地方移动，甚至仔细盯着部分细节看，然后再得到结论。
Attention就是在网络中加入关注区域的移动、缩放、旋转机制，连续部分信息的序列化输入。
关注区域的移动、缩放、旋转采用强化学习来实现。

Attention在视觉上的递归模型

参考 Recurrent Models of Visual Attention (2014.06.24)

模型

该模型称为The Recurrent Attention Model，简称RAM。

技术分享

A、Glimpse Sensor: 在

该模型每次迭代的时候，还可以输出缩放信息和结束标志。

训练

网络的参数可表示为

J (θ) = E p (s 1 : T; θ) [\sum t = 1 T r t] = E p (

强化学习的目标是提高

\nabla θ (log J) = E p (s 1 : T; θ) [\sum t = 1 T \nabla θ log π

其中

在学习训练过程中，

以上等式是梯度的无偏估计，但可引起高方差，所以引入以下估计

1 M \sum i = 1 M \sum t = 1 T \nabla θ log π ( u i t ∣ s i 1 : t ; θ )

其中

效果

技术分享

以上是论文中在识别扩大和污染了的minst数据库上，识别数字时，glimpse的移动方向。
实心绿点是开始，空心绿点是结束。
可以看到，RAM模型顺着感兴趣的方向移动。
识别效果比全链接的网络，和基于CNN的网络都要好。

Torch代码结构

在博客Recurrent Model of Visual Attention的训练代码中，结构如下

技术分享

(TODO)基于Attention的图片生成

Auto-Encoding Variational Bayes (2014.05.01)
DRAW: A Recurrent Neural Network For Image Generation (2015.05.20)

基于Attention的图片主题生成

参考 Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015.02.10)

技术分享

如上，根据图片，生成主题描述。

模型

技术分享

如上图，模型把图片经过CNN网络，变成特征图。
LSTM的RNN结构在此上运行Attention模型，最后得到主题输出。

编码

特征图均匀地切割成多个区域，表示为

a = {a 1, \dots, a L}, a i \in R D

L表示切割的区域个数。
如区域大小为

输出的主题

y = {y 1, \dots, y C}, y i \in R K

K是字典的单词个数，C是句子长度。

解码

该模型使用的LSTM如下图所示

技术分享

运算为

????? i t f t o t g t ????? = ????? σ σ σ

c t = f t ⊙ c t - 1 + i t ⊙ g t

h t = o t ⊙ tanh (c t)

其中

e t i = f a t t (a i, h t - 1)

α t i = exp ( e t i ) \sum L k = 1 exp ( e t k )

z^t=?({ai},{αti})z^t=?({ai},{αti})

其中

技术分享

LSTM中的记忆单元与隐藏单元的初始值，是两个不同的多层感知机，采用所有特征区域的平均值来进行预测的:

c 0 = f i n i t . c (1 L \sum i L a i )

h 0 = f i n i t . h (1 L \sum i L a i )

而最终的单词概率输出，采用深度输出层实现

p (y t ∣ a, y t - 1) \propto exp (L o (E y t - 1 +

其中

Stochastic “Hard” Attention

p (s t, i = 1 ∣ a) = α t, i

z^t = \sum i = 1 L s t, i a i

我们设置

L s = \sum s p (s ∣ a) log p (y ∣ s, a) \leq log \sum

对其进行参数求导有

\partial L s \partial W = \sum s p ( s ∣ a ) [ \partial log p ( y ∣ s , a )

以上参数求导可用Monte Carlo方法采样实现

s \sim t \sim M u l t i n o u l l i L ({α i})

\partial L s \partial W \approx 1 N \sum n = 1 N p ( s \sim n ∣ a ) [ \partial

为减少估计方差，可采用冲量方式，第k个 mini-batch 的时候

b k = 0.9 \times b k - 1 + 0.1 \times log p (y ∣ s \sim k, a)

为进一步减少估计方差，引入 multinoulli 分布的熵

\partial L s \partial W \approx 1 N \sum n = 1 N p ( s \sim n ∣ a ) [ \partial

Deterministic “Soft” Attention

上面的随机模型需要采样位置

E p (s t ∣ a) [z^t] = \sum i = 1 L α t, i a i

这就是Deterministic “Soft” Attention模型，通过

在计算

\sum t α t, i \approx 1

这个正则的加入，可以使得生成的主题更加丰富。就是结果更好嘛！

另外，在

E p (s t ∣ a) [z^t] = β \sum i = 1 L α t, i a i

β t = σ (f β (h t - 1))

最终，端到端的目标函数可写为

L d = - log (P (y ∣ x)) + λ \sum i L (1 - \sum t C α t, i) 2

基于Attention的字符识别

参考 Recursive Recurrent Nets with Attention Modeling for OCR in the Wild (2016.03.09)

模型

技术分享

Recursive / Recurrent CNN

技术分享

CNN是卷积层权重共享。
Recursive CNN是在卷积层中添加多层，每层的卷积核共享:

h i, j, k (t) = {σ ((w h h k) T x i, j + b k)

Recurrent CNN也是在卷积层中添加多层，但每层都在最初信息的参与，卷积核可以共享，也可能不共享:

h i, j, k (t) = σ ((w r k) T h i, j (t - 1) + (

Recursive与Recurrent CNN有都提高感受野，减少参数的作用。
在参考这篇论文中，有提到Recursive CNN效果比Recurrent CNN好。