标签:
在有监督学习中,训练样本是有类别标签的。现在假设我们只有一个没有带类别标签的训练样本集合 {
自编码神经网络是一种无监督学习算法,它使用了反向传播算法,并让目标值等于输入值,比如
自编码神经网络尝试学习一个
确实,自编码神经网络的输入和输出是不可能相等的,但是我们就是要强迫它相等,或者说尽可能地相等。(感觉它好不情愿= =)也就是:I ==> Sn ==> O , 且I要尽可能等于O。(I代表输入,Sn代表每一层中间隐藏层的输出a,O代表网络输出)
这么折腾究竟是要干嘛?要I等于O干嘛,好好的,要O去干嘛就直接拿I去不就好了。着实,O确实不重要,但是我们注意,Sn很重要!
Sn为什么重要?我们看上面的模型,输入I有6维,输出O有6维,中间层Sn呢?只有3维!这看出了什么?PCA?白化?差不多,但又有点不同。可以说是降维了,但PCA做的工作是提取了数据的最重要的成分,而这里的Sn是学习了数据更加本质的结构!为什么是这样?因为我们强迫它学习用3维的数据去表示6维的数据,为了完成这个目标,它不得不去寻找输入数据中存在的一些结构。
所以,中间层学习得到的3维输出Sn,就是深度学习网络学习得到的输入数据的更加本质的特征。如果增加中间层的层数,如下图:
也就是:I–>S1–>S2–>…–>Sn–>O,每一个中间层
哦哦,这里说一下,深度学习本质:深度学习模型是工具,目的是学习到输入数据的特征。
也就是说,我们最后的分类或者识别之类的,还要加个分类器或者其他的东西。
刚才的论述是基于隐藏神经元数量较小的假设。但是即使隐藏神经元的数量较大(可能比输入像素的个数还要多),我们仍然通过给自编码神经网络施加一些其他的限制条件来发现输入数据中的结构。具体来说,如果我们给隐藏神经元加入稀疏性限制,那么自编码神经网络即使在隐藏神经元数量较多的情况下仍然可以发现输入数据中一些有趣的结构。
稀疏性可以被简单地解释如下。如果当神经元的输出接近于1的时候我们认为它被激活,而输出接近于0的时候认为它被抑制,那么使得神经元大部分的时间都是被抑制的限制则被称作稀疏性限制。这里我们假设的神经元的激活函数是sigmoid函数。如果你使用tanh作为激活函数的话,当神经元输出为-1的时候,我们认为神经元是被抑制的。
令
其中
然后,我们又要委屈它了,我们加入一个条件:
其中,
何必为难它呢?为什么要让只要少部分中间隐藏神经元的活跃度,也就是输出值大于0,其他的大部分为0.原因就是我们要做的就是模拟我们人脑。神经网络本来就是模型人脑神经元的,深度学习也是。在人脑中有大量的神经元,但是大多数自然图像通过我们视觉进入人脑时,只会刺激到少部分神经元,而大部分神经元都是出于抑制状态的。而且,大多数自然图像,都可以被表示为少量基本元素(面或者线)的叠加。又或者说,这样更加有助于我们用少量的神经元提取出自然图像更加本质的特征。
为了实现这一限制,我们将会在我们的优化目标函数中加入一个额外的惩罚因子,而这一惩罚因子将惩罚那些
其中,
基于相对熵的话,上述惩罚因子也可以表示为:
假设
由上图可以看到,当
这里的输入数据是 8x8 的一个小图片,转换为 64x1 的矩阵,总共10000个样本进行训练,学习图片中的特征,其实结果就是图片的边缘。效果如下图:
代码如下:sparseAutoencoderCost.m
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)
%lambda = 0;
%beta = 0;
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
% notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example.
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
% 学习率 自己定义的
alpha = 0.01;
% 隐藏神经元的个数是 25 = hiddenSize
% 计算隐藏层神经元的激活度
p = zeros(hiddenSize,1);
% 25x64
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
% 64 X 25
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
% 25 X1
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
% 64 x 1
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
% costFunction 的第一项
%{
J_sparse = 0;
W1grad = zeros(size(W1));
W2grad = zeros(size(W2));
b1grad = zeros(size(b1));
b2grad = zeros(size(b2));
%}
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
% and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)
% with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
%
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.
%
% 批量梯度下降法的一次迭代 data 64x10000
numPatches = size(data,2);
KLdist = 0;
% 25x10000
%a2 = zeros(size(W1,1),numPatches);
% 64x10000
%a3 = zeros(size(W2,1),numPatches);
%% 向前传输
% 25x10000 25x64 64x10000
a2 = sigmoid(W1*data+repmat(b1,[1,numPatches]));
p = sum(a2,2);
a3 = sigmoid(W2 * a2 + repmat(b2,[1,numPatches]));
J_sparse = 0.5 * sum(sum((a3-data).^2));
%{
for curPatch = 1:numPatches
% 计算激活值
% 25 X1 第二层的激活值 25x64 64x1
a2(:,curPatch) = sigmoid(W1 * data(:,curPatch) + b1);
% 计算隐藏层神经元的总激活值
p = p + a2(:,curPatch);
% 64 x1 第三层的激活值
a3(:,curPatch) = sigmoid(W2 * a2(:,curPatch) +b2);
% 计算costFunction的第一项
J_sparse = J_sparse + 0.5 * (a3(:,curPatch)-data(:,curPatch))‘ * (a3(:,curPatch)-data(:,curPatch)) ;
end
%}
%% 计算 隐藏层的平均激活度
p = p / numPatches ;
%% 向后传输
%64x10000
residual3 = -(data-a3).*a3.*(1-a3);
%25x10000
tmp = beta * ( - sparsityParam ./ p + (1-sparsityParam) ./ (1-p));
% 25x10000 25x64 64x10000
residual2 = (W2‘ * residual3 + repmat(tmp,[1,numPatches])) .* a2.*(1-a2);
W2grad = residual3 * a2‘ / numPatches + lambda * W2 ;
W1grad = residual2 * data‘ / numPatches + lambda * W1 ;
b2grad = sum(residual3,2) / numPatches;
b1grad = sum(residual2,2) / numPatches;
%{
for curPatch = 1:numPatches
% 计算残差 64x1
% residual3 = -( data(:,curPatch) - a3(:,curPatch)) .* (a3 - a3.^2);
residual3 = -(data(:,curPatch) - a3(:,curPatch)).* (a3(:,curPatch) - (a3(:,curPatch).^2));
% 25x1 25x 64 * 64X1 ==> 25X1 .* 25X1
residual2 = (W2‘ * residual3 + beta * (- sparsityParam ./ p + (1-sparsityParam) ./ (1-p))) .* (a2(:,curPatch) - (a2(:,curPatch)).^2);
% residual2 = (W2‘ * residual3 ) .* (a2(:,curPatch) - (a2(:,curPatch)).^2);
% 计算偏导数值
% 64 x25 = 64x1 1x25
W2grad = W2grad + residual3 * a2(:,curPatch)‘;
% 64 x1 = 64x1
b2grad = b2grad + residual3;
% 25x64 = 25x1 * 1x64
W1grad = W1grad + residual2 * data(:,curPatch)‘;
% 25x1 = 25x1
b1grad = b1grad + residual2;
%J_sparse = J_sparse + (a3 - data(:,curPatch))‘ * (a3 - data(:,curPatch));
end
W2grad = W2grad / numPatches + lambda * W2;
W1grad = W1grad / numPatches + lambda * W1;
b2grad = b2grad / numPatches;
b1grad = b1grad / numPatches;
%}
%% 更新权重参数 加上 lambda 权重衰减
W2 = W2 - alpha * ( W2grad );
W1 = W1 - alpha * ( W1grad );
b2 = b2 - alpha * (b2grad );
b1 = b1 - alpha * (b1grad );
%% 计算KL相对熵
for j = 1:hiddenSize
KLdist = KLdist + sparsityParam *log( sparsityParam / p(j) ) + (1 - sparsityParam) * log((1-sparsityParam) / (1 - p(j)));
end
%% costFunction 加上 lambda 权重衰减
cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2 + beta * KLdist;
%cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2;
%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc). Specifically, we will unroll
% your gradient matrices into a vector.
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end
%-------------------------------------------------------------------
% Here‘s an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients. This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
代码中包含了向量方式实现和非向量方式实现。向量方式实现代码量少,运行速度也很快。代码中注释写得很清楚了,就不说了。
接下来是实现第二个例子,我们将从如下的图像中学习其中包含的特征,这里的输入图像是 28x28 ,隐藏层单元是 196 个,算法使用向量化编程,不然又得跑很久了吼吼吼。
原始图像如下:
最终学习得到的图像如下:
代码如下:
function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ...
lambda, sparsityParam, beta, data)
%lambda = 0;
%beta = 0;
% visibleSize: the number of input units (probably 64)
% hiddenSize: the number of hidden units (probably 25)
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
% notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example.
% The input theta is a vector (because minFunc expects the parameters to be a vector).
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this
% follows the notation convention of the lecture notes.
% 学习率 自己定义的
alpha = 0.03;
% 计算隐藏层神经元的激活度
p = zeros(hiddenSize,1);
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);
% Cost and gradient variables (your code needs to compute these values).
% Here, we initialize them to zeros.
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
% and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b)
% with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
%
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2.
%
numPatches = size(data,2);
KLdist = 0;
%% 向前传输
a2 = sigmoid(W1*data+repmat(b1,[1,numPatches]));
p = sum(a2,2);
a3 = sigmoid(W2 * a2 + repmat(b2,[1,numPatches]));
J_sparse = 0.5 * sum(sum((a3-data).^2));
%% 计算 隐藏层的平均激活度
p = p / numPatches ;
%% 向后传输
residual3 = -(data-a3).*a3.*(1-a3);
tmp = beta * ( - sparsityParam ./ p + (1-sparsityParam) ./ (1-p));
residual2 = (W2‘ * residual3 + repmat(tmp,[1,numPatches])) .* a2.*(1-a2);
W2grad = residual3 * a2‘ / numPatches + lambda * W2 ;
W1grad = residual2 * data‘ / numPatches + lambda * W1 ;
b2grad = sum(residual3,2) / numPatches;
b1grad = sum(residual2,2) / numPatches;
%% 更新权重参数 加上 lambda 权重衰减
W2 = W2 - alpha * ( W2grad );
W1 = W1 - alpha * ( W1grad );
b2 = b2 - alpha * (b2grad );
b1 = b1 - alpha * (b1grad );
%% 计算KL相对熵
for j = 1:hiddenSize
KLdist = KLdist + sparsityParam *log( sparsityParam / p(j) ) + (1 - sparsityParam) * log((1-sparsityParam) / (1 - p(j)));
end
%% costFunction 加上 lambda 权重衰减
cost = J_sparse / numPatches + (sum(sum(W1.^2)) + sum(sum(W2.^2))) * lambda / 2 + beta * KLdist;
%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc). Specifically, we will unroll
% your gradient matrices into a vector.
grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];
end
%-------------------------------------------------------------------
% Here‘s an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients. This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)).
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
DeepLearning(二) 自编码算法与稀疏性理解与实战
标签:
原文地址:http://blog.csdn.net/llp1992/article/details/45579615