码迷,mamicode.com
首页 > 其他好文 > 详细

使用神经网络来识别手写数字【转译】(二)

时间:2016-03-26 13:53:55      阅读:734      评论:0      收藏:0      [点我收藏+]

标签:

 

A simple network to classify handwritten digits

 

Having defined neural networks, let‘s return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we‘d like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we‘d like to break the image

 

技术分享

 

into six separate images,

 

技术分享

 

We humans solve this segmentation problem with ease, but it‘s challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we‘d like our program to recognize that the first digit above,

 

技术分享

 

is a 5.

We‘ll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it‘s probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we‘ll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.

To recognize individual digits we will use a three-layer neural network:

 

技术分享

 

The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many 2828 by 2828 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28784=28×28 neurons. For simplicity I‘ve omitted most of the 784784 input neurons in the diagram above. The input pixels are greyscale, with a value of 0.00.0 representing white, a value of 1.01.0representing black, and in between values representing gradually darkening shades of grey.

The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by nn, and we‘ll experiment with different values for nn. The example shown illustrates a small hidden layer, containing just n=15n=15 neurons.

The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output 1≈1, then that will indicate that the network thinks the digit is a 00. If the second neuron fires then that will indicate that the network thinks the digit is a 11. And so on. A little more precisely, we number the output neurons from 00 through99, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number 66, then our network will guess that the input digit was a 66. And so on for the other output neurons.

You might wonder why we use 1010 output neurons. After all, the goal of the network is to tell us which digit (0,1,2,,90,1,2,…,9) corresponds to the input image. A seemingly natural way of doing that is to use just44 output neurons, treating each neuron as taking on a binary value, depending on whether the neuron‘s output is closer to 00 or to 11. Four neurons are enough to encode the answer, since 24=1624=16 is more than the 10 possible values for the input digit. Why should our network use 1010 neurons instead? Isn‘t that inefficient? The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with 1010output neurons learns to recognize digits better than the network with 44 output neurons. But that leaves us wondering why using 1010output neurons works better. Is there some heuristic that would tell us in advance that we should use the 1010-output encoding instead of the 44-output encoding?

To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use 1010 output neurons. Let‘s concentrate on the first output neuron, the one that‘s trying to decide whether or not the digit is a 00. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing? Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present:

 

技术分享

 

It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let‘s suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:

 

技术分享

 

As you may have guessed, these four images together make up the 00image that we saw in the line of digits shown earlier:

 

技术分享

 

So if all four of these hidden neurons are firing then we can conclude that the digit is a 00. Of course, that‘s not the only sort of evidence we can use to conclude that the image was a 00 - we could legitimately get a 00 in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case we‘d conclude that the input was a 00.

 

 

 

Supposing the neural network functions in this way, we can give a plausible explanation for why it‘s better to have 1010 outputs from the network, rather than 44. If we had 44 outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And there‘s no easy way to relate that most significant bit to simple shapes like those shown above. It‘s hard to imagine that there‘s any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output.

Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only 44 output neurons. But as a heuristic the way of thinking I‘ve described works pretty well, and can save you a lot of time in designing good neural network architectures.

 

Exercise

  • There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 33 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.990.99, and incorrect outputs have activation less than 0.010.01.

 

 

技术分享

 

 

 

 

 

Learning with gradient descent

 

 

Now that we have a design for our neural network, how can it learn to recognize digits? The first thing we‘ll need is a data set to learn from - a so-called training data set. We‘ll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST‘s name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States‘ National Institute of Standards and Technology. Here‘s a few images from MNIST:

 

技术分享

 

As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we‘ll ask it to recognize images which aren‘t in the training set!

The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We‘ll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn‘t see during training.

We‘ll use the notation xx to denote a training input. It‘ll be convenient to regard each training input xx as a 28×28=78428×28=784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We‘ll denote the corresponding desired output by y=y(x)y=y(x), where yy is a 1010-dimensional vector. For example, if a particular training image, xx, depicts a 66, then y(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x)=(0,0,0,0,0,0,1,0,0,0)T is the desired output from the network. Note that TT here is the transpose operation, turning a row vector into an ordinary (column) vector.

What we‘d like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x)y(x) for all training inputs xx. To quantify how well we‘re achieving this goal we define a cost function**Sometimes referred to as a loss or objectivefunction. We use the term cost function throughout this book, but you should note the other terminology, since it‘s often used in research papers and other discussions of neural networks.:

C(w,b)12nxy(x)a2.(6)(6)C(w,b)≡12n∑x∥y(x)−a∥2.

Here, ww denotes the collection of all weights in the network, bb all the biases, nn is the total number of training inputs, aa is the vector of outputs from the network when xx is input, and the sum is over all training inputs, xx. Of course, the output aa depends on xx, ww and bb, but to keep the notation simple I haven‘t explicitly indicated this dependence. The notation v∥v∥ just denotes the usual length function for a vector vv. We‘ll call CC the quadratic cost function; it‘s also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b)C(w,b) is non-negative, since every term in the sum is non-negative. Furthermore, the cost C(w,b)C(w,b) becomes small, i.e., C(w,b)0C(w,b)≈0, precisely when y(x)y(x) is approximately equal to the output, aa, for all training inputs, xx. So our training algorithm has done a good job if it can find weights and biases so that C(w,b)0C(w,b)≈0. By contrast, it‘s not doing so well when C(w,b)C(w,b) is large - that would mean that y(x)y(x) is not close to the output aa for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w,b)C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We‘ll do that using an algorithm known asgradient descent.

 

 

 

 

 

 

 

 

 

 

 

Why introduce the quadratic cost? After all, aren‘t we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won‘t cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That‘s why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.

 

Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6). Isn‘t this a rather ad hoc choice? Perhaps if we chose a different cost function we‘d get a totally different set of minimizing weights and biases? This is a valid concern, and later we‘ll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6) works perfectly well for understanding the basics of learning in neural networks, so we‘ll stick with it for now.

Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)C(w,b). This is a well-posed problem, but it‘s got a lot of distracting structure as currently posed - the interpretation of ww and bb as weights and biases, the σσ function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we‘re going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we‘re going to imagine that we‘ve simply been given a function of many variables and we want to minimize that function. We‘re going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we‘ll come back to the specific function we want to minimize for neural networks.

Okay, let‘s suppose we‘re trying to minimize some function, C(v)C(v). This could be any real-valued function of many variables, v=v1,v2,v=v1,v2,…. Note that I‘ve replaced the ww and bb notation by vv to emphasize that this could be any function - we‘re not specifically thinking in the neural networks context any more. To minimize C(v)C(v) it helps to imagine CC as a function of just two variables, which we‘ll call v1v1 and v2v2:

 

技术分享

 

What we‘d like is to find where CC achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I‘ve perhaps shown slightly too simple a function! A general function, CC, may be a complicated function of many variables, and it won‘t usually be possible to just eyeball the graph to find the minimum.

One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where CC is an extremum. With some luck that might work when CC is a function of just one or a few variables. But it‘ll turn into a nightmare when we have many more variables. And for neural networks we‘ll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won‘t work!

(After asserting that we‘ll gain insight by imagining CC as a function of just two variables, I‘ve turned around twice in two paragraphs and said, "hey, but what if it‘s a function of many more than two variables?" Sorry about that. Please believe me when I say that it really does help to imagine CC as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it‘s appropriate to use each picture, and when it‘s not.)

Okay, so calculus doesn‘t work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn‘t be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We‘d randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of CC - those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.

Based on what I‘ve just written, you might suppose that we‘ll be trying to write down Newton‘s equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we‘re not going to take the ball-rolling analogy quite that seriously - we‘re devising an algorithm to minimize CC, not developing an accurate simulation of the laws of physics! The ball‘s-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let‘s simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?

To make this question more precise, let‘s think about what happens when we move the ball a small amount Δv1Δv1 in the v1v1 direction, and a small amount Δv2Δv2 in the v2v2 direction. Calculus tells us that CCchanges as follows:

ΔCCv1Δv1+Cv2Δv2.(7)(7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.

We‘re going to find a way of choosing Δv1Δv1 and Δv2Δv2 so as to make ΔCΔC negative; i.e., we‘ll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to defineΔvΔv to be the vector of changes in vv, Δv(Δv1,Δv2)TΔv≡(Δv1,Δv2)T, where TT is again the transpose operation, turning row vectors into column vectors. We‘ll also define the gradient of CC to be the vector of partial derivatives, (Cv1,Cv2)T(∂C∂v1,∂C∂v2)T. We denote the gradient vector by C∇C, i.e.:

C(Cv1,Cv2)T.(8)(8)∇C≡(∂C∂v1,∂C∂v2)T.

In a moment we‘ll rewrite the change ΔCΔC in terms of ΔvΔv and the gradient, C∇C. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the C∇C notation for the first time, people sometimes wonder how they should think about the ∇ symbol. What, exactly, does ∇ mean? In fact, it‘s perfectly fine to think of C∇C as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, ∇ is just a piece of notational flag-waving, telling you "hey, C∇C is a gradient vector". There are more advanced points of view where ∇ can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won‘t need such points of view.

 

With these definitions, the expression (7) for ΔCΔC can be rewritten as

ΔCCΔv.(9)(9)ΔC≈∇C⋅Δv.

This equation helps explain why C∇C is called the gradient vector: C∇C relates changes in vv to changes in CC, just as we‘d expect something called a gradient to do. But what‘s really exciting about the equation is that it lets us see how to choose ΔvΔv so as to make ΔCΔC negative. In particular, suppose we choose

Δv=ηC,(10)(10)Δv=−η∇C,

where ηη is a small, positive parameter (known as the learning rate). Then Equation (9) tells us that ΔCηCC=ηC2ΔC≈−η∇C⋅∇C=−η∥∇C∥2. Because C20∥∇C∥2≥0, this guarantees that ΔC0ΔC≤0, i.e., CC will always decrease, never increase, if we change vv according to the prescription in (10). (Within, of course, the limits of the approximation in Equation (9)). This is exactly the property we wanted! And so we‘ll take Equation (10) to define the "law of motion" for the ball in our gradient descent algorithm. That is, we‘ll use Equation (10) to compute a value for ΔvΔv, then move the ball‘s position vv by that amount:

vv=vηC.(11)(11)v→v′=v−η∇C.

Then we‘ll use this update rule again, to make another move. If we keep doing this, over and over, we‘ll keep decreasing CC until - we hope - we reach a global minimum.

 

Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient C∇C, and then to move in theopposite direction, "falling down" the slope of the valley. We can visualize it like this:

 

技术分享

 

Notice that with this rule gradient descent doesn‘t reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It‘s only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing ΔvΔv just says "go down, right now". That‘s still a pretty good rule for finding the minimum!

To make gradient descent work correctly, we need to choose the learning rate ηη to be small enough that Equation (9) is a good approximation. If we don‘t, we might end up with ΔC>0ΔC>0, which obviously would not be good! At the same time, we don‘t want ηη to be too small, since that will make the changes ΔvΔv tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, ηη is often varied so that Equation (9) remains a good approximation, but the algorithm isn‘t too slow. We‘ll see later how this works.

I‘ve explained gradient descent when CC is a function of just two variables. But, in fact, everything works just as well even when CC is a function of many more variables. Suppose in particular that CC is a function of mm variables, v1,,vmv1,…,vm. Then the change ΔCΔC in CCproduced by a small change Δv=(Δv1,,Δvm)TΔv=(Δv1,…,Δvm)T is

ΔCCΔv,(12)(12)ΔC≈∇C⋅Δv,

where the gradient C∇C is the vector

C(Cv1,,Cvm)T.(13)(13)∇C≡(∂C∂v1,…,∂C∂vm)T.

Just as for the two variable case, we can choose

Δv=ηC,(14)(14)Δv=−η∇C,

and we‘re guaranteed that our (approximate) expression (12) for ΔCΔC will be negative. This gives us a way of following the gradient to a minimum, even when CC is a function of many variables, by repeatedly applying the update rule

vv=vηC.(15)(15)v→v′=v−η∇C.

You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vv in order to find a minimum of the function CC. The rule doesn‘t always work - several things can go wrong and prevent gradient descent from finding the global minimum of CC, a point we‘ll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we‘ll find that it‘s a powerful way of minimizing the cost function, and so helping the net learn.

 

 

 

Indeed, there‘s even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let‘s suppose that we‘re trying to make a move ΔvΔv in position so as to decrease CC as much as possible. This is equivalent to minimizing ΔCCΔvΔC≈∇C⋅Δv. We‘ll constrain the size of the move so that Δv=?∥Δv∥=? for some small fixed ?>0?>0. In other words, we want a move that is a small step of a fixed size, and we‘re trying to find the movement direction which decreases CC as much as possible. It can be proved that the choice of ΔvΔv which minimizes CΔv∇C⋅Δv is Δv=ηCΔv=−η∇C, where η=?/Cη=?/∥∇C∥ is determined by the size constraint Δv=?∥Δv∥=?. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CC.

 

Exercises

  • Prove the assertion of the last paragraph. Hint: If you‘re not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.

     

     

  • I explained gradient descent when CC is a function of two variables, and when it‘s a function of more than two variables. What happens when CC is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?

 

 

People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of CC, and this can be quite costly. To see why it‘s costly, suppose we want to compute all the second partial derivatives 2C/vjvk∂2C/∂vj∂vk. If there are a million such vjvj variables then we‘d need to compute something like a trillion (i.e., a million squared) second partial derivatives**Actually, more like half a trillion, since2C/vjvk=2C/vkvj∂2C/∂vj∂vk=∂2C/∂vk∂vj. Still, you get the point.! That‘s going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we‘ll use gradient descent (and variations) as our main approach to learning in neural networks.

How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wkwk and biasesblbl which minimize the cost in Equation (6). To see how this works, let‘s restate the gradient descent update rule, with the weights and biases replacing the variables vjvj. In other words, our "position" now has components wkwk and blbl, and the gradient vector C∇C has corresponding components C/wk∂C/∂wk and C/bl∂C/∂bl. Writing out the gradient descent update rule in terms of components, we have

wkblwk=wkηCwkbl=blηCbl.(16)(17)(16)wk→wk′=wk−η∂C∂wk(17)bl→bl′=bl−η∂C∂bl.

By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.

 

There are a number of challenges in applying the gradient descent rule. We‘ll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let‘s look back at the quadratic cost in Equation (6). Notice that this cost function has the form C=1nxCxC=1n∑xCx, that is, it‘s an average over costs Cxy(x)a22Cx≡∥y(x)−a∥22 for individual training examples. In practice, to compute the gradient C∇C we need to compute the gradients Cx∇Cx separately for each training input, xx, and then average them, C=1nxCx∇C=1n∑x∇Cx. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.

An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient C∇C by computing Cx∇Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient C∇C, and this helps speed up gradient descent, and thus learning.

To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number mm of randomly chosen training inputs. We‘ll label those random training inputs X1,X2,,XmX1,X2,…,Xm, and refer to them as a mini-batch. Provided the sample size mm is large enough we expect that the average value of the CXj∇CXj will be roughly equal to the average over all Cx∇Cx, that is,

mj=1CXjmxCxn=C,(18)(18)∑j=1m∇CXjm≈∑x∇Cxn=∇C,

where the second sum is over the entire set of training data. Swapping sides we get

C1mj=1mCXj,(19)(19)∇C≈1m∑j=1m∇CXj,

confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch.

 

To connect this explicitly to learning in neural networks, suppose wkwk and blbl denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those,

wkblwk=wkηmjCXjwkbl=blηmjCXjbl,(20)(21)(20)wk→wk′=wk−ηm∑j∂CXj∂wk(21)bl→bl′=bl−ηm∑j∂CXj∂bl,

where the sums are over all the training examples XjXj in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we‘ve exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.

 

Incidentally, it‘s worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6) we scaled the overall cost function by a factor 1n1n. People sometimes omit the 1n1n, summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isn‘t known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20) and (21) sometimes omit the 1m1m term out the front of the sums. Conceptually this makes little difference, since it‘s equivalent to rescaling the learning rate ηη. But when doing detailed comparisons of different work it‘s worth watching out for.

We can think of stochastic gradient descent as being like political polling: it‘s much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size n=60,000n=60,000, as in MNIST, and choose a mini-batch size of (say) m=10m=10, this means we‘ll get a factor of 6,0006,000speedup in estimating the gradient! Of course, the estimate won‘t be perfect - there will be statistical fluctuations - but it doesn‘t need to be perfect: all we really care about is moving in a general direction that will help decrease CC, and that means we don‘t need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it‘s the basis for most of the learning techniques we‘ll develop in this book.

 

 

 

 

 

 

Exercise

  • An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, xx, we update our weights and biases according to the rules wkwk=wkηCx/wkwk→wk′=wk−η∂Cx/∂wk and blbl=blηCx/blbl→bl′=bl−η∂Cx/∂bl. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as onlineon-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 2020.

 

Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost CCis, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: "Hey, I have to be able to visualize all these extra dimensions". And they may start to worry: "I can‘t think in four dimensions, let alone five (or five million)". Is there some special ability they‘re missing, some ability that "real" supermathematicians have? Of course, the answer is no. Even most professional mathematicians can‘t visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing what‘s going on. That‘s exactly what we did above: we used an algebraic (rather than visual) representation of ΔCΔC to figure out how to move so as to decrease CC. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines; our algebraic trick is just one example. Those techniques may not have the simplicity we‘re accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I won‘t go into more detail here, but if you‘re interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone.

 

 

Implementing our network to classify digits

 

Alright, let‘s write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. We‘ll do this with a short Python (2.7) program, just 74 lines of code! The first thing we need is to get the MNIST data. If you‘re a git user then you can obtain the data by cloning the code repository for this book,

 

git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git

 

If you don‘t use git then you can download the data and code here.

Incidentally, when I described the MNIST data earlier, I said it was split into 60,000 training images, and 10,000 test images. That‘s the official MNIST description. Actually, we‘re going to split the data a little differently. We‘ll leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which we‘ll use to train our neural network, and a separate 10,000 image validation set. We won‘t use the validation data in this chapter, but later in the book we‘ll find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which aren‘t directly selected by our learning algorithm. Although the validation data isn‘t part of the original MNIST specification, many people use MNIST in this fashion, and the use of validation data is common in neural networks. When I refer to the "MNIST training data" from now on, I‘ll be referring to our 50,000 image data set, not the original 60,000 image data set**As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States‘ National Institute of Standards and Technology. To construct MNIST the NIST data sets were stripped down and put into a more convenient format by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. See this link for more details. The data set in my repository is in a form that makes it easy to load and manipulate the MNIST data in Python. I obtained this particular form of the data from the LISA machine learning laboratory at the University of Montreal (link)..

 

Apart from the MNIST data we also need a Python library calledNumpy, for doing fast linear algebra. If you don‘t already have Numpy installed, you can get it here.

Let me explain the core features of the neural networks code, before giving a full listing, below. The centerpiece is a Network class, which we use to represent a neural network. Here‘s the code we use to initialize a Network object:

 

class Network(object):

    def __init__(self, sizes):
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) 
                        for x, y in zip(sizes[:-1], sizes[1:])]

 

In this code, the list sizes contains the number of neurons in the respective layers. So, for example, if we want to create a Networkobject with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, we‘d do this with the code:

net = Network([2, 3, 1])

The biases and weights in the Network object are all initialized randomly, using the Numpy np.random.randn function to generate Gaussian distributions with mean 00 and standard deviation 11. This random initialization gives our stochastic gradient descent algorithm a place to start from. In later chapters we‘ll find better ways of initializing the weights and biases, but this will do for now. Note that the Network initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers.

 

Note also that the biases and weights are stored as lists of Numpy matrices. So, for example net.weights[1] is a Numpy matrix storing the weights connecting the second and third layers of neurons. (It‘s not the first and second layers, since Python‘s list indexing starts at0.) Since net.weights[1] is rather verbose, let‘s just denote that matrix ww. It‘s a matrix such that wjkwjk is the weight for the connection between the kthkth neuron in the second layer, and the jthjth neuron in the third layer. This ordering of the jj and kk indices may seem strange - surely it‘d make more sense to swap the jj and kk indices around? The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is:

a=σ(wa+b).(22)(22)a′=σ(wa+b).

There‘s quite a bit going on in this equation, so let‘s unpack it piece by piece. aa is the vector of activations of the second layer of neurons. To obtain aa′ we multiply aa by the weight matrix ww, and add the vector bb of biases. We then apply the function σσelementwise to every entry in the vector wa+bwa+b. (This is calledvectorizing the function σσ.) It‘s easy to verify that Equation (22)gives the same result as our earlier rule, Equation (4), for computing the output of a sigmoid neuron.

 

 

Exercise

  • Write out Equation (22) in component form, and verify that it gives the same result as the rule (4) for computing the output of a sigmoid neuron.

 

With all this in mind, it‘s easy to write code computing the output from a Network instance. We begin by defining the sigmoid function:

def sigmoid(z):
    return 1.0/(1.0+np.exp(-z))

Note that when the input z is a vector or Numpy array, Numpy automatically applies the function sigmoid elementwise, that is, in vectorized form.

 

We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output**It is assumed that the input a is an (n, 1)Numpy ndarray, not a (n,) vector. Here, n is the number of inputs to the network. If you try to use an (n,) vector as input you‘ll get strange results. Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient.. All the method does is applies Equation (22) for each layer:

    def feedforward(self, a):
        """Return the output of the network if "a" is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

 

Of course, the main thing we want our Network objects to do is to learn. To that end we‘ll give them an SGD method which implements stochastic gradient descent. Here‘s the code. It‘s a little mysterious in a few places, but I‘ll break it down below, after the listing.

 

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The "training_data" is a list of tuples
        "(x, y)" representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If "test_data" is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

 

The training_data is a list of tuples (x, y) representing the training inputs and corresponding desired outputs. The variables epochs andmini_batch_size are what you‘d expect - the number of epochs to train for, and the size of the mini-batches to use when sampling. eta is the learning rate, ηη. If the optional argument test_data is supplied, then the program will evaluate the network after each epoch of training, and print out partial progress. This is useful for tracking progress, but slows things down substantially.

The code works as follows. In each epoch, it starts by randomly shuffling the training data, and then partitions it into mini-batches of the appropriate size. This is an easy way of sampling randomly from the training data. Then for each mini_batch we apply a single step of gradient descent. This is done by the codeself.update_mini_batch(mini_batch, eta), which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in mini_batch. Here‘s the code for the update_mini_batch method:

    def update_mini_batch(self, mini_batch, eta):
        """Update the network‘s weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The "mini_batch" is a list of tuples "(x, y)", and "eta"
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw 
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb 
                       for b, nb in zip(self.biases, nabla_b)]

Most of the work is done by the line

            delta_nabla_b, delta_nabla_w = self.backprop(x, y)

This invokes something called the backpropagation algorithm, which is a fast way of computing the gradient of the cost function. So update_mini_batch works simply by computing these gradients for every training example in the mini_batch, and then updatingself.weights and self.biases appropriately.

 

I‘m not going to show the code for self.backprop right now. We‘ll study how backpropagation works in the next chapter, including the code for self.backprop. For now, just assume that it behaves as claimed, returning the appropriate gradient for the cost associated to the training example x.

Let‘s look at the full program, including the documentation strings, which I omitted above. Apart from self.backprop the program is self-explanatory - all the heavy lifting is done in self.SGD andself.update_mini_batch, which we‘ve already discussed. Theself.backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoid_prime, which computes the derivative of the σσ function, and self.cost_derivative, which I won‘t describe here. You can get the gist of these (and perhaps the details) just by looking at the code and documentation strings. We‘ll look at them in detail in the next chapter. Note that while the program appears lengthy, much of the code is documentation strings intended to make the code easy to understand. In fact, the program contains just 74 lines of non-whitespace, non-comment code. All the code may be found on GitHub here.

 

 

"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won‘t set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network‘s weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) *             sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It‘s a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network‘s output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

 

How well does the program recognize handwritten digits? Well, let‘s start by loading in the MNIST data. I‘ll do this using a little helper program, mnist_loader.py, to be described below. We execute the following commands in a Python shell,

 

>>> import mnist_loader
>>> training_data, validation_data, test_data = ... mnist_loader.load_data_wrapper()

 

Of course, this could also be done in a separate Python program, but if you‘re following along it‘s probably easiest to do in a Python shell.

After loading the MNIST data, we‘ll set up a Network with 3030 hidden neurons. We do this after importing the Python program listed above, which is named network,

 

>>> import network
>>> net = network.Network([784, 30, 10])

 

Finally, we‘ll use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of η=3.0η=3.0,

 

>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

 

Note that if you‘re running the code as you read along, it will take some time to execute - for a typical machine (as of 2015) it will likely take a few minutes to run. I suggest you set things running, continue to read, and periodically check the output from the code. If you‘re in a rush you can speed things up by decreasing the number of epochs, by decreasing the number of hidden neurons, or by using only part of the training data. Note that production code would be much, much faster: these Python scripts are intended to help you understand how neural nets work, not to be high-performance code! And, of course, once we‘ve trained a network it can be run very quickly indeed, on almost any computing platform. For example, once we‘ve learned a good set of weights and biases for a network, it can easily be ported to run in Javascript in a web browser, or as a native app on a mobile device. In any case, here is a partial transcript of the output of one training run of the neural network. The transcript shows the number of test images correctly recognized by the neural network after each epoch of training. As you can see, after just a single epoch this has reached 9,129 out of 10,000, and the number continues to grow,

 

Epoch 0: 9129 / 10000
Epoch 1: 9295 / 10000
Epoch 2: 9348 / 10000
...
Epoch 27: 9528 / 10000
Epoch 28: 9542 / 10000
Epoch 29: 9534 / 10000

 

That is, the trained network gives us a classification rate of about 9595percent - 95.4295.42 percent at its peak ("Epoch 28")! That‘s quite encouraging as a first attempt. I should warn you, however, that if you run the code then your results are not necessarily going to be quite the same as mine, since we‘ll be initializing our network using (different) random weights and biases. To generate results in this chapter I‘ve taken best-of-three runs.

Let‘s rerun the above experiment, changing the number of hidden neurons to 100100. As was the case earlier, if you‘re running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so it‘s wise to continue reading in parallel while the code executes.

 

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

 

Sure enough, this improves the results to 96.5996.59 percent. At least in this case, using more hidden neurons helps us get better results**Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse. Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks..

Of course, to obtain these accuracies I had to make specific choices for the number of epochs of training, the mini-batch size, and the learning rate, ηη. As I mentioned above, these are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. Suppose, for example, that we‘d chosen the learning rate to be η=0.001η=0.001,

 

>>> net = network.Network([784, 100, 10])
>>> net.SGD(training_data, 30, 10, 0.001, test_data=test_data)

 

The results are much less encouraging,

Epoch 0: 1139 / 10000
Epoch 1: 1136 / 10000
Epoch 2: 1135 / 10000
...
Epoch 27: 2101 / 10000
Epoch 28: 2123 / 10000
Epoch 29: 2142 / 10000

However, you can see that the performance of the network is getting slowly better over time. That suggests increasing the learning rate, say to η=0.01η=0.01. If we do that, we get better results, which suggests increasing the learning rate again. (If making a change improves things, try doing more!) If we do that several times over, we‘ll end up with a learning rate of something like η=1.0η=1.0 (and perhaps fine tune to 3.03.0), which is close to our earlier experiments. So even though we initially made a poor choice of hyper-parameters, we at least got enough information to help us improve our choice of hyper-parameters.

 

In general, debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise. Suppose we try the successful 30 hidden neuron network architecture from earlier, but with the learning rate changed to η=100.0η=100.0:

>>> net = network.Network([784, 30, 10])
>>> net.SGD(training_data, 30, 10, 100.0, test_data=test_data)

At this point we‘ve actually gone too far, and the learning rate is too high:

Epoch 0: 1009 / 10000
Epoch 1: 1009 / 10000
Epoch 2: 1009 / 10000
Epoch 3: 1009 / 10000
...
Epoch 27: 982 / 10000
Epoch 28: 982 / 10000
Epoch 29: 982 / 10000

Now imagine that we were coming to this problem for the first time. Of course, we know from our earlier experiments that the right thing to do is to decrease the learning rate. But if we were coming to this problem for the first time then there wouldn‘t be much in the output to guide us on what to do. We might worry not only about the learning rate, but about every other aspect of our neural network. We might wonder if we‘ve initialized the weights and biases in a way that makes it hard for the network to learn? Or maybe we don‘t have enough training data to get meaningful learning? Perhaps we haven‘t run for enough epochs? Or maybe it‘s impossible for a neural network with this architecture to learn to recognize handwritten digits? Maybe the learning rate is too low? Or, maybe, the learning rate is too high? When you‘re coming to a problem for the first time, you‘re not always sure.

 

The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. We‘ll discuss all these at length through the book, including how I chose the hyper-parameters above.

 

Exercise

 

 

  • Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively. Train the network using stochastic gradient descent. What classification accuracy can you achieve?

 

 

Earlier, I skipped over the details of how the MNIST data is loaded. It‘s pretty straightforward. For completeness, here‘s the code. The data structures used to store the MNIST data are described in the documentation strings - it‘s straightforward stuff, tuples and lists of Numpy ndarray objects (think of them as vectors if you‘re not familiar with ndarrays):

 

"""
mnist_loader
~~~~~~~~~~~~

A library to load the MNIST image data.  For details of the data
structures that are returned, see the doc strings for ``load_data``
and ``load_data_wrapper``.  In practice, ``load_data_wrapper`` is the
function usually called by our neural network code.
"""

#### Libraries
# Standard library
import cPickle
import gzip

# Third-party libraries
import numpy as np

def load_data():
    """Return the MNIST data as a tuple containing the training data,
    the validation data, and the test data.

    The ``training_data`` is returned as a tuple with two entries.
    The first entry contains the actual training images.  This is a
    numpy ndarray with 50,000 entries.  Each entry is, in turn, a
    numpy ndarray with 784 values, representing the 28 * 28 = 784
    pixels in a single MNIST image.

    The second entry in the ``training_data`` tuple is a numpy ndarray
    containing 50,000 entries.  Those entries are just the digit
    values (0...9) for the corresponding images contained in the first
    entry of the tuple.

    The ``validation_data`` and ``test_data`` are similar, except
    each contains only 10,000 images.

    This is a nice data format, but for use in neural networks it‘s
    helpful to modify the format of the ``training_data`` a little.
    That‘s done in the wrapper function ``load_data_wrapper()``, see
    below.
    """
    f = gzip.open(‘../data/mnist.pkl.gz‘, ‘rb‘)
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()
    return (training_data, validation_data, test_data)

def load_data_wrapper():
    """Return a tuple containing ``(training_data, validation_data,
    test_data)``. Based on ``load_data``, but the format is more
    convenient for use in our implementation of neural networks.

    In particular, ``training_data`` is a list containing 50,000
    2-tuples ``(x, y)``.  ``x`` is a 784-dimensional numpy.ndarray
    containing the input image.  ``y`` is a 10-dimensional
    numpy.ndarray representing the unit vector corresponding to the
    correct digit for ``x``.

    ``validation_data`` and ``test_data`` are lists containing 10,000
    2-tuples ``(x, y)``.  In each case, ``x`` is a 784-dimensional
    numpy.ndarry containing the input image, and ``y`` is the
    corresponding classification, i.e., the digit values (integers)
    corresponding to ``x``.

    Obviously, this means we‘re using slightly different formats for
    the training data and the validation / test data.  These formats
    turn out to be the most convenient for use in our neural network
    code."""
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = zip(training_inputs, training_results)
    validation_inputs = [np.reshape(x, (784, 1)) for x in va_d[0]]
    validation_data = zip(validation_inputs, va_d[1])
    test_inputs = [np.reshape(x, (784, 1)) for x in te_d[0]]
    test_data = zip(test_inputs, te_d[1])
    return (training_data, validation_data, test_data)

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

 

I said above that our program gets pretty good results. What does that mean? Good compared to what? It‘s informative to have some simple (non-neural-network) baseline tests to compare against, to understand what it means to perform well. The simplest baseline of all, of course, is to randomly guess the digit. That‘ll be right about ten percent of the time. We‘re doing much better than that!

What about a less trivial baseline? Let‘s try an extremely simple idea: we‘ll look at how dark an image is. For instance, an image of a 22 will typically be quite a bit darker than an image of a 11, just because more pixels are blackened out, as the following examples illustrate:

 

技术分享

 

This suggests using the training data to compute average darknesses for each digit, 0,1,2,,90,1,2,…,9. When presented with a new image, we compute how dark the image is, and then guess that it‘s whichever digit has the closest average darkness. This is a simple procedure, and is easy to code up, so I won‘t explicitly write out the code - if you‘re interested it‘s in the GitHub repository. But it‘s a big improvement over random guessing, getting 2,2252,225 of the 10,00010,000test images correct, i.e., 22.2522.25 percent accuracy.

It‘s not difficult to find other ideas which achieve accuracies in the 2020 to 5050 percent range. If you work a bit harder you can get up over 5050 percent. But to get much higher accuracies it helps to use established machine learning algorithms. Let‘s try using one of the best known algorithms, the support vector machine or SVM. If you‘re not familiar with SVMs, not to worry, we‘re not going to need to understand the details of how SVMs work. Instead, we‘ll use a Python library called scikit-learn, which provides a simple Python interface to a fast C-based library for SVMs known as LIBSVM.

If we run scikit-learn‘s SVM classifier using the default settings, then it gets 9,435 of 10,000 test images correct. (The code is available here.) That‘s a big improvement over our naive approach of classifying an image based on how dark it is. Indeed, it means that the SVM is performing roughly as well as our neural networks, just a little worse. In later chapters we‘ll introduce new techniques that enable us to improve our neural networks so that they perform much better than the SVM.

That‘s not the end of the story, however. The 9,435 of 10,000 result is for scikit-learn‘s default settings for SVMs. SVMs have a number of tunable parameters, and it‘s possible to search for parameters which improve this out-of-the-box performance. I won‘t explicitly do this search, but instead refer you to this blog post by Andreas Mueller if you‘d like to know more. Mueller shows that with some work optimizing the SVM‘s parameters it‘s possible to get the performance up above 98.5 percent accuracy. In other words, a well-tuned SVM only makes an error on about one digit in 70. That‘s pretty good! Can neural networks do better?

In fact, they can. At present, well-designed neural networks outperform every other technique for solving MNIST, including SVMs. The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li WanMatthew Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. We‘ll see most of the techniques they used later in the book. At that level the performance is close to human-equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence, for example:

 

技术分享

 

I trust you‘ll agree that those are tough to classify! With images like these in the MNIST data set it‘s remarkable that neural networks can accurately classify all but 21 of the 10,000 test images. Usually, when programming we believe that solving a complicated problem like recognizing the MNIST digits requires a sophisticated algorithm. But even the neural networks in the Wan et al paper just mentioned involve quite simple algorithms, variations on the algorithm we‘ve seen in this chapter. All the complexity is learned, automatically, from the training data. In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems:

sophisticated algorithm ≤ simple learning algorithm + good training data.

 

 

Toward deep learning

 

While our neural network gives impressive performance, that performance is somewhat mysterious. The weights and biases in the network were discovered automatically. And that means we don‘t immediately have an explanation of how the network does what it does. Can we find some way to understand the principles by which our network is classifying handwritten digits? And, given such principles, can we do better?

To put these questions more starkly, suppose that a few decades hence neural networks lead to artificial intelligence (AI). Will we understand how such intelligent networks work? Perhaps the networks will be opaque to us, with weights and biases we don‘t understand, because they‘ve been learned automatically. In the early days of AI research people hoped that the effort to build an AI would also help us understand the principles behind intelligence and, maybe, the functioning of the human brain. But perhaps the outcome will be that we end up understanding neither the brain nor how artificial intelligence works!

To address these questions, let‘s think back to the interpretation of artificial neurons that I gave at the start of the chapter, as a means of weighing evidence. Suppose we want to determine whether an image shows a human face or not:

 

Credits: 1. Ester Inbar. 2. Unknown. 3. NASA, ESA, G. Illingworth, D. Magee, and P. Oesch (University of California, Santa Cruz), R. Bouwens (Leiden University), and the HUDF09 Team. Click on the images for more details.

技术分享 技术分享 技术分享

We could attack this problem the same way we attacked handwriting recognition - by using the pixels in the image as input to a neural network, with the output from the network a single neuron indicating either "Yes, it‘s a face" or "No, it‘s not a face".

Let‘s suppose we do this, but that we‘re not using a learning algorithm. Instead, we‘re going to try to design a network by hand, choosing appropriate weights and biases. How might we go about it? Forgetting neural networks entirely for the moment, a heuristic we could use is to decompose the problem into sub-problems: does the image have an eye in the top left? Does it have an eye in the top right? Does it have a nose in the middle? Does it have a mouth in the bottom middle? Is there hair on top? And so on.

If the answers to several of these questions are "yes", or even just "probably yes", then we‘d conclude that the image is likely to be a face. Conversely, if the answers to most of the questions are "no", then the image probably isn‘t a face.

Of course, this is just a rough heuristic, and it suffers from many deficiencies. Maybe the person is bald, so they have no hair. Maybe we can only see part of the face, or the face is at an angle, so some of the facial features are obscured. Still, the heuristic suggests that if we can solve the sub-problems using neural networks, then perhaps we can build a neural network for face-detection, by combining the networks for the sub-problems. Here‘s a possible architecture, with rectangles denoting the sub-networks. Note that this isn‘t intended as a realistic approach to solving the face-detection problem; rather, it‘s to help us build intuition about how networks function. Here‘s the architecture:

 

技术分享

 

It‘s also plausible that the sub-networks can be decomposed. Suppose we‘re considering the question: "Is there an eye in the top left?" This can be decomposed into questions such as: "Is there an eyebrow?"; "Are there eyelashes?"; "Is there an iris?"; and so on. Of course, these questions should really include positional information, as well - "Is the eyebrow in the top left, and above the iris?", that kind of thing - but let‘s keep it simple. The network to answer the question "Is there an eye in the top left?" can now be decomposed:

 

技术分享

 

Those questions too can be broken down, further and further through multiple layers. Ultimately, we‘ll be working with sub-networks that answer questions so simple they can easily be answered at the level of single pixels. Those questions might, for example, be about the presence or absence of very simple shapes at particular points in the image. Such questions can be answered by single neurons connected to the raw pixels in the image.

The end result is a network which breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels. It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts. Networks with this kind of many-layer structure - two or more hidden layers - are called deep neural networks.

 

 

当然,我没有说过怎样递归分解成子网络。 It certainly isn‘t practical to hand-design the weights and biases in the network. Instead, we‘d like to use learning algorithms so that the network can automatically learn the weights and biases - and thus, the hierarchy of concepts - from training data. Researchers in the 1980s and 1990s tried using stochastic gradient descent and backpropagation to train deep networks. Unfortunately, except for a few special architectures, they didn‘t have much luck. The networks would learn, but very slowly, and in practice often too slowly to be useful.

2006年以来,一系列可用户深度学习神经网络的新技术被开发出来。这些深度学习技术是基于随机梯度下降算法和反向传播算法的。但也引入了新的思想。 These techniques have enabled much deeper (and larger) networks to be trained - people now routinely train networks with 5 to 10 hidden layers. And, it turns out that these perform far better on many problems than shallow neural networks, i.e., networks with just a single hidden layer. The reason, of course, is the ability of deep nets to build up a complex hierarchy of concepts. It‘s a bit like the way conventional programming languages use modular design and ideas about abstraction to enable the creation of complex computer programs. Comparing a deep network to a shallow network is a bit like comparing a programming language with the ability to make function calls to a stripped down language with no ability to make such calls. Abstraction takes a different form in neural networks than it does in conventional programming, but it‘s just as important.

使用神经网络来识别手写数字【转译】(二)

标签:

原文地址:http://www.cnblogs.com/pathrough/p/5322736.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!