标签:
学习UFDL栈式自编码算法的笔记
深度神经网络,即含有多个隐藏层的神经网络。通过引入深度网络,我们可以计算更多复杂的输入特征。因为每一个隐藏层可以对上一层的输出进行非线性变换,因此深度神经网络拥有比“浅层”网络更加优异的表达能力(例如可以学到更加复杂的函数关系)。
其实三层网络,只要能无限增加隐层的单元数就能拟合任何函数。而使用深度网络的最主要优势是:它能以更加紧凑简洁的方式来表达比浅层网络大得多的函数集合。正式点说,我们可以找到一些函数,这些函数可以用k层网络简洁地表达出来(这里的简洁是指隐层单元数目只需与输入单元数目呈多项式关系)。但是对于一个只有k-1层的网络而言,除非它使用与输入单元数目呈指数关系的隐层单元数目,否则不能表达这些函数。
举一个简单的例子,比如我们打算构建一个布尔网络来计算 n 个输入比特的奇偶校验码(或者进行异或运算)。假设网络中的每一个节点都可以进行逻辑“或”运算(或者“与非”运算),亦或者逻辑“与”运算。如果我们拥有一个仅仅由一个输入层、一个隐层以及一个输出层构成的网络,那么该奇偶校验函数所需要的节点数目与输入层的规模n呈指数关系。但是,如果我们构建一个更深点的网络,那么这个网络的规模就可做到仅仅是n的多项式函数。
数据获取问题
因为深度网络是个复杂的模型,其参数比较多。所有要有足够多的样本来学习到这些参数。不然的话,在不充足的数据上进行训练将会导致过拟合。
局部极值问题
对深度网络而言,由于模型的复杂度比较高,所以在优化问题的搜索区域中充斥着大量坏的局部极值,因而使用梯度下降法(或者像共轭梯度下降法,L-BFGS等方法)效果并不好。
逐层贪婪训练方法是取得一定成功的一种方法。逐层贪婪算法的主要思路是每次只训练网络的一层,即我们首先训练一个只含一个一个隐藏层的网络,仅当这层网络训练结束之后才开始训练一个有两个隐藏层的网络,一次类推。在每一步中,我们把已经训练好的前 k-1 层固定,然后增加第 k 层(也就是将我们已经训练好的前 k-1 的输出作为输入)。每一层的训练可以是有监督的(例如,将每一步的分类误差作为目标函数),但更通常使用无监督方法(例如自动编码器,我们会在后边的章节中给出细节)。这些各层单独训练所得到的权重被用来初始化最终(或者说全部)的深度网络的权重,然后对整个网络进行“微调”(即把所有层放在一起来优化有标签训练集上的训练误差).
逐层贪婪法的好处是:相比于随机初始化而言,各层初始权重会位于参数空间中较好的位置上。然后我们可以从这些位置出发进一步微调权重。从经验上来说,以这些位置为起点开始梯度下降更有可能收敛到比较好的局部极值点,这是因为无标签数据已经提供了大量输入数据中包含的模式的先验信息。
栈式自编码神经网络是一个由多层稀疏自编码器组成的神经网络,其前一层自编码器的输出作为其后一层自编码器的输入。在训练每一层参数的时候,会固定其它各层参数保持不变。所以,如果想得到更好的结果,在上述预训练过程完成之后,可以通过反向传播算法同时调整所有层的参数以改善结果,这个过程一般被称作“微调(fine-tuning)”。
如果你只对以分类为目的的微调感兴趣,那么惯用的做法是丢掉栈式自编码网络的“解码”层,直接把最后一个隐藏层的 a^{(n)} 作为特征输入到softmax分类器进行分类,这样,分类器(softmax)的分类错误的梯度值就可以直接反向传播给编码层了。
这里是对手写数字进行分类,神经网络的层数是4层。最后一层是一个softmax分类器。即手写数字先通过两层隐层把一些特征组合起来,softmax分类器根据这些特征进行分类。
程序的流程:
- 先训练第一层网络的权重:输入图片到稀疏自编码器,就会得到参数W^1和W^2,保留W^1。
- 上面根据W^1可以算出经过第一层隐层的输出值。把这个输出值作为输入到稀疏自编码器,同样可以得到第二层隐层的参数。
- 把第二层隐层通过计算得到的输出值,作为特征输入到softmax分类器。softmax分类器经过训练得到了分类器的参数。
- 为了得到更好的结果,在上述预训练过程完成之后,可以通过反向传播算法同时调整所有层的参数以改善结果,这个过程一般被称作“微调(fine-tuning)”。
Ps:第三层的误差是直接由softmax分类误差反向传播过来的。在微调的时候,即使用反向传播时,这里并没有加规则化项(权重衰减项)(不知道具体原因是什么)。
%% CS294A/CS294W Stacked Autoencoder Exercise
% Instructions
% ------------
%
% This file contains code that helps you get started on the
% sstacked autoencoder exercise. You will need to complete code in
% stackedAECost.m
% You will also need to have implemented sparseAutoencoderCost.m and
% softmaxCost.m from previous exercises. You will need the initializeParameters.m
% loadMNISTImages.m, and loadMNISTLabels.m files from previous exercises.
%
% For the purpose of completing the assignment, you do not need to
% change the code in this file.
%
%%======================================================================
%% STEP 0: Here we provide the relevant parameters values that will
% allow your sparse autoencoder to get good filters; you do not need to
% change the parameters below.
inputSize = 28 * 28;
numClasses = 10;
hiddenSizeL1 = 200; % Layer 1 Hidden Size
hiddenSizeL2 = 200; % Layer 2 Hidden Size
sparsityParam = 0.1; % desired average activation of the hidden units.
% (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
% in the lecture notes).
lambda = 3e-3; % weight decay parameter
beta = 3; % weight of sparsity penalty term
%%======================================================================
%% STEP 1: Load data from the MNIST database
%
% This loads our training data from the MNIST database files.
% Load MNIST database files
trainData = loadMNISTImages(‘train-images-idx3-ubyte‘);
trainLabels = loadMNISTLabels(‘train-labels-idx1-ubyte‘);
trainLabels(trainLabels == 0) = 10; % Remap 0 to 10 since our labels need to start from 1
%%======================================================================
%% STEP 2: Train the first sparse autoencoder
% This trains the first sparse autoencoder on the unlabelled STL training
% images.
% If you‘ve correctly implemented sparseAutoencoderCost.m, you don‘t need
% to change anything here.
% Randomly initialize the parameters
sae1Theta = initializeParameters(hiddenSizeL1, inputSize);
%% ---------------------- YOUR CODE HERE ---------------------------------
% Instructions: Train the first layer sparse autoencoder, this layer has
% an hidden size of "hiddenSizeL1"
% You should store the optimal parameters in sae1OptTheta
addpath minFunc/
options.Method = ‘lbfgs‘; % Here, we use L-BFGS to optimize our cost
% function. Generally, for minFunc to work, you
% need a function pointer with two outputs: the
% function value and the gradient. In our problem,
% sparseAutoencoderCost.m satisfies this.
options.maxIter = 400; % Maximum number of iterations of L-BFGS to run
options.display = ‘on‘;
[sae1OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
inputSize, hiddenSizeL1, ...
lambda, sparsityParam, ...
beta, trainData), ...
sae1Theta, options);
% -------------------------------------------------------------------------
%%======================================================================
%% STEP 2: Train the second sparse autoencoder
% This trains the second sparse autoencoder on the first autoencoder
% featurse.
% If you‘ve correctly implemented sparseAutoencoderCost.m, you don‘t need
% to change anything here.
[sae1Features] = feedForwardAutoencoder(sae1OptTheta, hiddenSizeL1, ...
inputSize, trainData);
% Randomly initialize the parameters
sae2Theta = initializeParameters(hiddenSizeL2, hiddenSizeL1);
%% ---------------------- YOUR CODE HERE ---------------------------------
% Instructions: Train the second layer sparse autoencoder, this layer has
% an hidden size of "hiddenSizeL2" and an inputsize of
% "hiddenSizeL1"
%
% You should store the optimal parameters in sae2OptTheta
[sae2OptTheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ...
hiddenSizeL1,hiddenSizeL2 ,...
lambda, sparsityParam, ...
beta, sae1Features), ...
sae2Theta, options);
% -------------------------------------------------------------------------
%%======================================================================
%% STEP 3: Train the softmax classifier
% This trains the sparse autoencoder on the second autoencoder features.
% If you‘ve correctly implemented softmaxCost.m, you don‘t need
% to change anything here.
[sae2Features] = feedForwardAutoencoder(sae2OptTheta, hiddenSizeL2, ...
hiddenSizeL1, sae1Features);
% Randomly initialize the parameters
saeSoftmaxTheta = 0.005 * randn(hiddenSizeL2 * numClasses, 1);
%% ---------------------- YOUR CODE HERE ---------------------------------
% Instructions: Train the softmax classifier, the classifier takes in
% input of dimension "hiddenSizeL2" corresponding to the
% hidden layer size of the 2nd layer.
%
% You should store the optimal parameters in saeSoftmaxOptTheta
%
% NOTE: If you used softmaxTrain to complete this part of the exercise,
% set saeSoftmaxOptTheta = softmaxModel.optTheta(:);
options.maxIter = 100;
lambda = 1e-4; % Weight decay parameter
softmaxModel = softmaxTrain(hiddenSizeL2, numClasses, lambda, ...
sae2Features, trainLabels, options);
saeSoftmaxOptTheta = softmaxModel.optTheta(:);
% -------------------------------------------------------------------------
%%======================================================================
%% STEP 5: Finetune softmax model
% Implement the stackedAECost to give the combined cost of the whole model
% then run this cell.
% Initialize the stack using the parameters learned
stack = cell(2,1);
stack{1}.w = reshape(sae1OptTheta(1:hiddenSizeL1*inputSize), ...
hiddenSizeL1, inputSize);
stack{1}.b = sae1OptTheta(2*hiddenSizeL1*inputSize+1:2*hiddenSizeL1*inputSize+hiddenSizeL1);
stack{2}.w = reshape(sae2OptTheta(1:hiddenSizeL2*hiddenSizeL1), ...
hiddenSizeL2, hiddenSizeL1);
stack{2}.b = sae2OptTheta(2*hiddenSizeL2*hiddenSizeL1+1:2*hiddenSizeL2*hiddenSizeL1+hiddenSizeL2);
% Initialize the parameters for the deep model
[stackparams, netconfig] = stack2params(stack);
stackedAETheta = [ saeSoftmaxOptTheta ; stackparams ];
%% ---------------------- YOUR CODE HERE ---------------------------------
% Instructions: Train the deep network, hidden size here refers to the ‘
% dimension of the input to the classifier, which corresponds
% to "hiddenSizeL2".
%
%
[stackedAETheta, cost] = minFunc( @(p) stackedAECost(p, ...
inputSize,hiddenSizeL2 ,...
numClasses, netconfig, ...
lambda, trainData, trainLabels), ...
stackedAETheta, options);
% -------------------------------------------------------------------------
%%======================================================================
%% STEP 6: Test
% Instructions: You will need to complete the code in stackedAEPredict.m
% before running this part of the code
%
% Get labelled test images
% Note that we apply the same kind of preprocessing as the training set
testData = loadMNISTImages(‘t10k-images-idx3-ubyte‘);
testLabels = loadMNISTLabels(‘t10k-labels-idx1-ubyte‘);
testLabels(testLabels == 0) = 10; % Remap 0 to 10
[pred] = stackedAEPredict(stackedAETheta, inputSize, hiddenSizeL2, ...
numClasses, netconfig, testData);
acc = mean(testLabels(:) == pred(:));
fprintf(‘Before Finetuning Test Accuracy: %0.3f%%\n‘, acc * 100);
[pred] = stackedAEPredict(stackedAEOptTheta, inputSize, hiddenSizeL2, ...
numClasses, netconfig, testData);
acc = mean(testLabels(:) == pred(:));
fprintf(‘After Finetuning Test Accuracy: %0.3f%%\n‘, acc * 100);
% Accuracy is the proportion of correctly classified images
% The results for our implementation were:
%
% Before Finetuning Test Accuracy: 87.7%
% After Finetuning Test Accuracy: 97.6%
%
% If your values are too low (accuracy less than 95%), you should check
% your code for errors, and make sure you are training on the
% entire data set of 60000 28x28 training images
% (unless you modified the loading code, this should be the case)
function [ cost, grad ] = stackedAECost(theta, inputSize, hiddenSize, ...
numClasses, netconfig, ...
lambda, data, labels)
% stackedAECost: Takes a trained softmaxTheta and a training data set with labels,
% and returns cost and gradient using a stacked autoencoder model. Used for
% finetuning.
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize: the number of hidden units *at the 2nd layer*
% numClasses: the number of categories
% netconfig: the network configuration of the stack
% lambda: the weight regularization penalty
% data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example.
% labels: A vector containing labels, where labels(i) is the label for the
% i-th training example
%% Unroll softmaxTheta parameter
% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
% You will need to compute the following gradients
softmaxThetaGrad = zeros(size(softmaxTheta));
stackgrad = cell(size(stack));
for d = 1:numel(stack)
stackgrad{d}.w = zeros(size(stack{d}.w));
stackgrad{d}.b = zeros(size(stack{d}.b));
end
cost = 0; % You need to compute this
% You might find these variables useful
M = size(data, 2);
groundTruth = full(sparse(labels, 1:M, 1));
%% --------------------------- YOUR CODE HERE -----------------------------
% Instructions: Compute the cost function and gradient vector for
% the stacked autoencoder.
%
% You are given a stack variable which is a cell-array of
% the weights and biases for every layer. In particular, you
% can refer to the weights of Layer d, using stack{d}.w and
% the biases using stack{d}.b . To get the total number of
% layers, you can use numel(stack).
%
% The last layer of the network is connected to the softmax
% classification layer, softmaxTheta.
%
% You should compute the gradients for the softmaxTheta,
% storing that in softmaxThetaGrad. Similarly, you should
% compute the gradients for each layer in the stack, storing
% the gradients in stackgrad{d}.w and stackgrad{d}.b
% Note that the size of the matrices in stackgrad should
% match exactly that of the size of the matrices in stack.
%
%% 先是数据正向传播
m=size(data,2);%样本的个数
a2=sigmoid(stack{1}.w*data+repmat(stack{1}.b,1,m));
a3=sigmoid(stack{2}.w*a2+repmat(stack{2}.b,1,m));
%% softmax分类器
M=softmaxTheta*a3;%M中的数据如果太大,可能会出现数据溢出,每一列减去每一列最大的数据
M = bsxfun(@minus, M, max(M, [], 1));
ExpM=exp(M);
ExpM_row=sum(ExpM);
Prob=ExpM./repmat(ExpM_row,size(M,1),1);
% cost=-1/m*sum(sum(groundTruth .*log(Prob)))+sum(sum(theta.*theta))*lambda/2;
cost=-(groundTruth(:)‘*log(Prob(:)))/m+lambda/2*sumsqr(softmaxTheta);%代价函数
softmaxThetaGrad=-(groundTruth-Prob)*a3‘/m+lambda*softmaxTheta;%梯度函数</span>
delta3=-(softmaxTheta‘*(groundTruth-Prob)).*(a3.*(1-a3));
stackgrad{2}.w=delta3*a2‘/m;
stackgrad{2}.b=sum(delta3,2)/m;
% W2grad=delta3*a2‘/m+lambda*W2;
delta2=(stack{2}.w‘*delta3).*(a2.*(1-a2));
stackgrad{1}.w=delta2*data‘/m;
stackgrad{1}.b=sum(delta2,2)/m;
% -------------------------------------------------------------------------
%% Roll gradient vector
grad = [softmaxThetaGrad(:) ; stack2params(stackgrad)];
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data)
% stackedAEPredict: Takes a trained theta and a test data set,
% and returns the predicted labels for each example.
% theta: trained weights from the autoencoder
% visibleSize: the number of input units
% hiddenSize: the number of hidden units *at the 2nd layer*
% numClasses: the number of categories
% data: Our matrix containing the training data as columns. So, data(:,i) is the i-th training example.
% Your code should produce the prediction matrix
% pred, where pred(i) is argmax_c P(y(c) | x(i)).
%% Unroll theta parameter
% We first extract the part which compute the softmax gradient
softmaxTheta = reshape(theta(1:hiddenSize*numClasses), numClasses, hiddenSize);
% Extract out the "stack"
stack = params2stack(theta(hiddenSize*numClasses+1:end), netconfig);
%% ---------- YOUR CODE HERE --------------------------------------
% Instructions: Compute pred using theta assuming that the labels start
% from 1.
m=size(data,2);%样本的个数
a2=sigmoid(stack{1}.w*data+repmat(stack{1}.b,1,m));
a3=sigmoid(stack{2}.w*a2+repmat(stack{2}.b,1,m));
%% softmax分类器
M=softmaxTheta*a3;%M中的数据如果太大,可能会出现数据溢出,每一列减去每一列最大的数据
M = bsxfun(@minus, M, max(M, [], 1));
ExpM=exp(M);
ExpM_row=sum(ExpM);
Prob=ExpM./repmat(ExpM_row,size(M,1),1);
[~,pred]=max(Prob);
% -----------------------------------------------------------
end
% You might find this useful
function sigm = sigmoid(x)
sigm = 1 ./ (1 + exp(-x));
end
标签:
原文地址:http://blog.csdn.net/aisikaov5/article/details/51193137