DeepLearning(二) 自编码算法与稀疏性理解与实战

时间：2015-05-08 14:53:35 阅读：200 评论：0 收藏：0 [点我收藏+]

标签：

【原创】Liu_LongPo 转载请注明出处
【CSDN】http://blog.csdn.net/llp1992

在有监督学习中，训练样本是有类别标签的。现在假设我们只有一个没有带类别标签的训练样本集合 { $x^{(1)},x^{(2)},x^{(3)},...$ },其中 $x^{(i)}\in R$

自编码神经网络是一种无监督学习算法，它使用了反向传播算法，并让目标值等于输入值，比如 $y^{(i)}=x^{(i)}$ ,示例图如下：

自编码神经网络尝试学习一个 $h_{W,b}(x)\approx x$ 的函数。这看上去好像并不现实，因为我们在信息论中学过，信息经过每一层传递处理都会发生损耗，因此自编码神经网络的输入和输出是不可能相等的吼吼吼。

确实，自编码神经网络的输入和输出是不可能相等的，但是我们就是要强迫它相等，或者说尽可能地相等。（感觉它好不情愿= =）也就是：I ==> Sn ==> O , 且I要尽可能等于O。（I代表输入，Sn代表每一层中间隐藏层的输出a，O代表网络输出）

这么折腾究竟是要干嘛？要I等于O干嘛，好好的，要O去干嘛就直接拿I去不就好了。着实，O确实不重要，但是我们注意，Sn很重要！
Sn为什么重要？我们看上面的模型，输入I有6维，输出O有6维，中间层Sn呢？只有3维！这看出了什么？PCA?白化？差不多，但又有点不同。可以说是降维了，但PCA做的工作是提取了数据的最重要的成分，而这里的Sn是学习了数据更加本质的结构！为什么是这样？因为我们强迫它学习用3维的数据去表示6维的数据，为了完成这个目标，它不得不去寻找输入数据中存在的一些结构。

所以，中间层学习得到的3维输出Sn，就是深度学习网络学习得到的输入数据的更加本质的特征。如果增加中间层的层数，如下图：

也就是：I–>S1–>S2–>…–>Sn–>O,每一个中间层 $S_{n-1}$ 的输出 $a_{n-1}$ ,都作为下一层 $S_{n}$ 的输入 , 于是每一个中间层学习到的特征都得到进一步学习抽象，比如，输入I是人脸的图像，S1学习到了图像的边缘，S2学习到了图像的边缘的组合，比如鼻子，S3学习到了人脸的大概模型等等，因此多层深度学习学习到的最终的特征将能够非常透彻的描述输入数据的本质，从而大大增加了最后面的分类和识别的精确度。

哦哦，这里说一下，深度学习本质：深度学习模型是工具，目的是学习到输入数据的特征。

也就是说，我们最后的分类或者识别之类的，还要加个分类器或者其他的东西。

稀疏性限制

刚才的论述是基于隐藏神经元数量较小的假设。但是即使隐藏神经元的数量较大（可能比输入像素的个数还要多），我们仍然通过给自编码神经网络施加一些其他的限制条件来发现输入数据中的结构。具体来说，如果我们给隐藏神经元加入稀疏性限制，那么自编码神经网络即使在隐藏神经元数量较多的情况下仍然可以发现输入数据中一些有趣的结构。
稀疏性可以被简单地解释如下。如果当神经元的输出接近于1的时候我们认为它被激活，而输出接近于0的时候认为它被抑制，那么使得神经元大部分的时间都是被抑制的限制则被称作稀疏性限制。这里我们假设的神经元的激活函数是sigmoid函数。如果你使用tanh作为激活函数的话，当神经元输出为-1的时候，我们认为神经元是被抑制的。

令 $a_j^{(2)}(x)$ 表示输入为 $x$ 时自编码神经网络隐藏神经元 $j$ 的激活度，可得到

p' j = 1 m \sum i = 1 m [a (2) j x (i)]

$p‘_j=\frac {1}{m} \sum_{i=1}^{m}[a_j^{(2)}{x^{(i)}}]$

其中 $p‘$ 表示隐藏神经元 $j$ 的平均活跃度，注意，这里是在训练集上求平均。

然后，我们又要委屈它了，我们加入一个条件：

p' j = p

$p‘_j = p$

其中， $p$ 为稀疏性参数，是一个比较接近于0的值，比如0.05.为了满足这个条件，我们得让大多数隐藏神经元的活跃度接近0.

何必为难它呢？为什么要让只要少部分中间隐藏神经元的活跃度，也就是输出值大于0，其他的大部分为0.原因就是我们要做的就是模拟我们人脑。神经网络本来就是模型人脑神经元的，深度学习也是。在人脑中有大量的神经元，但是大多数自然图像通过我们视觉进入人脑时，只会刺激到少部分神经元，而大部分神经元都是出于抑制状态的。而且，大多数自然图像，都可以被表示为少量基本元素（面或者线）的叠加。又或者说，这样更加有助于我们用少量的神经元提取出自然图像更加本质的特征。

为了实现这一限制，我们将会在我们的优化目标函数中加入一个额外的惩罚因子，而这一惩罚因子将惩罚那些 $p‘_j$ 和 $p$ 显著不同的情况，惩罚因子如下：

\sum j = 1 s 2 p log p p ' j + (1 ? p) log 1 ? p 1 ? p ' j

$\sum_{j=1}^{s_2}p\log {\frac{p}{p‘_j}} + (1-p) \log \frac {1-p}{1-p‘_j}$

其中， $s_2$ 表示隐藏神经元的数量。

基于相对熵的话，上述惩罚因子也可以表示为:

\sum j = 1 s 2 K L (p | | p' j)

$\sum_{j=1}^{s_2}KL(p||p‘_j)$

假设 $p=0.2$ ，则 $\sum_{j=1}^{s_2}KL(p||p‘_j)$ 随着 $p‘_j$ 的变化如下图：

由上图可以看到，当 $p‘_j = p$ 的时候， $\sum_{j=1}^{s_2}KL(p||p‘_j)$ 的值为0，而当 $p‘_j$ 远离 $p$ 的时候， $\sum_{j=1}^{s_2}KL(p||p‘_j)$ 的值快速增大。因此，很明显，这个惩罚因子的作用就是让 $p‘_j$ 尽可能靠近 $p$ ,从而达到我们的稀疏性限制。更加具体的计算，请参考ULFDL。

Matlab 实战

这里的输入数据是 8x8 的一个小图片，转换为 64x1 的矩阵，总共10000个样本进行训练，学习图片中的特征，其实结果就是图片的边缘。效果如下图：

代码如下：sparseAutoencoderCost.m

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)
%lambda = 0;
%beta = 0;
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 

% The input theta is a vector (because minFunc expects the parameters to be a vector). 

% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

% 学习率 自己定义的
alpha = 0.01;

% 隐藏神经元的个数是   25   = hiddenSize

% 计算隐藏层神经元的激活度
p = zeros(hiddenSize,1);

% 25x64
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
% 64 X 25
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
% 25 X1
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
% 64 x 1
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 

% costFunction 的第一项
%{
J_sparse = 0;
W1grad = zeros(size(W1)); 
W2grad = zeros(size(W2));
b1grad = zeros(size(b1)); 
b2grad = zeros(size(b2));
%}
%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) 
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term 
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
% 
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. 
% 


% 批量梯度下降法的一次迭代  data 64x10000
numPatches = size(data,2);
KLdist = 0;

% 25x10000
%a2 = zeros(size(W1,1),numPatches);
% 64x10000
%a3 = zeros(size(W2,1),numPatches);

%% 向前传输
% 25x10000  25x64 64x10000 
a2 = sigmoid(W1*data+repmat(b1,[1,numPatches]));
p = sum(a2,2);
a3 = sigmoid(W2 * a2 + repmat(b2,[1,numPatches]));
J_sparse = 0.5 * sum(sum((a3-data).^2));

%{
for curPatch = 1:numPatches

    % 计算激活值   
    % 25 X1 第二层的激活值   25x64  64x1
    a2(:,curPatch) = sigmoid(W1 * data(:,curPatch) + b1);
    % 计算隐藏层神经元的总激活值
    p = p + a2(:,curPatch); 
    % 64 x1 第三层的激活值
    a3(:,curPatch) = sigmoid(W2 * a2(:,curPatch) +b2);    
    %  计算costFunction的第一项
    J_sparse = J_sparse + 0.5 * (a3(:,curPatch)-data(:,curPatch))‘ * (a3(:,curPatch)-data(:,curPatch)) ;
end
%}

%% 计算 隐藏层的平均激活度
p = p /  numPatches ;

%% 向后传输 

 %64x10000
    residual3 = -(data-a3).*a3.*(1-a3);
    %25x10000
    tmp = beta * ( - sparsityParam ./ p + (1-sparsityParam) ./ (1-p));
    %  25x10000   25x64 64x10000  
    residual2 = (W2‘ * residual3 + repmat(tmp,[1,numPatches])) .* a2.*(1-a2);
    W2grad = residual3 * a2‘ / numPatches + lambda * W2 ;
    W1grad = residual2 * data‘  / numPatches + lambda * W1 ;
    b2grad = sum(residual3,2) / numPatches; 
    b1grad = sum(residual2,2) / numPatches; 

    %{
for curPatch = 1:numPatches

    %  计算残差  64x1    
   % residual3 = -( data(:,curPatch) - a3(:,curPatch)) .* (a3 - a3.^2);
    residual3 = -(data(:,curPatch) - a3(:,curPatch)).* (a3(:,curPatch) - (a3(:,curPatch).^2));
    %  25x1         25x 64  *  64X1   ==>  25X1  .*   25X1
    residual2 = (W2‘ * residual3 + beta * (- sparsityParam ./ p + (1-sparsityParam) ./ (1-p))) .* (a2(:,curPatch) - (a2(:,curPatch)).^2);
  %  residual2 = (W2‘ * residual3 ) .* (a2(:,curPatch) - (a2(:,curPatch)).^2);
    % 计算偏导数值
    %   64 x25   =  64x1    1x25
    W2grad = W2grad + residual3 * a2(:,curPatch)‘;
    % 64 x1 = 64x1
    b2grad = b2grad + residual3;
    % 25x64  =  25x1  * 1x64
    W1grad = W1grad + residual2 *   data(:,curPatch)‘;
    % 25x1 = 25x1
    b1grad = b1grad + residual2;
    %J_sparse = J_sparse + (a3 - data(:,curPatch))‘ * (a3 - data(:,curPatch));

end

W2grad = W2grad / numPatches + lambda * W2;
W1grad = W1grad / numPatches + lambda * W1;
b2grad = b2grad / numPatches;
b1grad = b1grad / numPatches;

 %}

%% 更新权重参数   加上 lambda  权重衰减
W2 = W2 - alpha * ( W2grad  );
W1 = W1 - alpha * ( W1grad );

b2 = b2 - alpha * (b2grad );
b1 = b1 - alpha * (b1grad );

%% 计算KL相对熵
for j = 1:hiddenSize
    KLdist = KLdist + sparsityParam *log( sparsityParam / p(j) )   +   (1 - sparsityParam) * log((1-sparsityParam) / (1 - p(j)));
end

%% costFunction 加上 lambda 权重衰减
cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2  + beta * KLdist;

%cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2;


%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------
% Here‘s an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)
    sigm = 1 ./ (1 + exp(-x));
end

代码中包含了向量方式实现和非向量方式实现。向量方式实现代码量少，运行速度也很快。代码中注释写得很清楚了，就不说了。

接下来是实现第二个例子，我们将从如下的图像中学习其中包含的特征，这里的输入图像是 28x28 ，隐藏层单元是 196 个，算法使用向量化编程，不然又得跑很久了吼吼吼。

原始图像如下：

最终学习得到的图像如下：

代码如下：

function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)
%lambda = 0;
%beta = 0;
% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 

% The input theta is a vector (because minFunc expects the parameters to be a vector). 

% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

% 学习率 自己定义的
alpha = 0.03;

% 计算隐藏层神经元的激活度
p = zeros(hiddenSize,1);

W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) 
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term 
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
% 
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. 
% 
numPatches = size(data,2);
KLdist = 0;

%% 向前传输

a2 = sigmoid(W1*data+repmat(b1,[1,numPatches]));
p = sum(a2,2);
a3 = sigmoid(W2 * a2 + repmat(b2,[1,numPatches]));
J_sparse = 0.5 * sum(sum((a3-data).^2));

%% 计算 隐藏层的平均激活度
p = p /  numPatches ;

%% 向后传输 

    residual3 = -(data-a3).*a3.*(1-a3);
    tmp = beta * ( - sparsityParam ./ p + (1-sparsityParam) ./ (1-p));
    residual2 = (W2‘ * residual3 + repmat(tmp,[1,numPatches])) .* a2.*(1-a2);

    W2grad = residual3 * a2‘ / numPatches + lambda * W2 ;
    W1grad = residual2 * data‘  / numPatches + lambda * W1 ;
    b2grad = sum(residual3,2) / numPatches; 
    b1grad = sum(residual2,2) / numPatches; 

%% 更新权重参数   加上 lambda  权重衰减
W2 = W2 - alpha * ( W2grad  );
W1 = W1 - alpha * ( W1grad );

b2 = b2 - alpha * (b2grad );
b1 = b1 - alpha * (b1grad );

%% 计算KL相对熵
for j = 1:hiddenSize
    KLdist = KLdist + sparsityParam *log( sparsityParam / p(j) )   +   (1 - sparsityParam) * log((1-sparsityParam) / (1 - p(j)));
end

%% costFunction 加上 lambda 权重衰减
cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2  + beta * KLdist;

%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];

end

%-------------------------------------------------------------------
% Here‘s an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)

    sigm = 1 ./ (1 + exp(-x));
end

DeepLearning(二) 自编码算法与稀疏性理解与实战

标签：

原文地址：http://blog.csdn.net/llp1992/article/details/45579615

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行