标签:
In the last chapter we saw how neural networks can learn their weights and biases using the gradient descent algorithm. There was, however, a gap in our explanation: we didn‘t discuss how to compute the gradient of the cost function. That‘s quite a gap! In this chapter I‘ll explain a fast algorithm for computing such gradients, an algorithm known as backpropagation.
The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn‘t fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams. That paper describes several neural networks where backpropagation works far faster than earlier approaches to learning, making it possible to use neural nets to solve problems which had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in neural networks.
This chapter is more mathematically involved than the rest of the book. If you‘re not crazy about mathematics you may be tempted to skip the chapter, and to treat backpropagation as a black box whose details you‘re willing to ignore. Why take the time to study those details?
The reason, of course, is understanding. At the heart of backpropagation is an expression for the partial derivative ∂C/∂w∂C/∂wof the cost function CC with respect to any weight ww (or bias bb) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. And while the expression is somewhat complex, it also has a beauty to it, with each element having a natural, intuitive interpretation. And so backpropagation isn‘t just a fast algorithm for learning. It actually gives us detailed insights into how changing the weights and biases changes the overall behaviour of the network. That‘s well worth studying in detail.
With that said, if you want to skim the chapter, or jump straight to the next chapter, that‘s fine. I‘ve written the rest of the book to be accessible even if you treat backpropagation as a black box. There are, of course, points later in the book where I refer back to results from this chapter. But at those points you should still be able to understand the main conclusions, even if you don‘t follow all the reasoning.
Before discussing backpropagation, let‘s warm up with a fast matrix-based algorithm to compute the output from a neural network. We actually already briefly saw this algorithm near the end of the last chapter, but I described it quickly, so it‘s worth revisiting in detail. In particular, this is a good way of getting comfortable with the notation used in backpropagation, in a familiar context.
Let‘s begin with a notation which lets us refer to weights in the network in an unambiguous way. We‘ll use wljkwjkl to denote the weight for the connection from the kthkth neuron in the (l−1)th(l−1)th layer to the jthjth neuron in the lthlth layer. So, for example, the diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:
This notation is cumbersome at first, and it does take some work to master. But with a little effort you‘ll find the notation becomes easy and natural. One quirk of the notation is the ordering of the jj and kkindices. You might think that it makes more sense to use jj to refer to the input neuron, and kk to the output neuron, not vice versa, as is actually done. I‘ll explain the reason for this quirk below.
We use a similar notation for the network‘s biases and activations. Explicitly, we use bljbjl for the bias of the jthjth neuron in the lthlth layer. And we use aljajl for the activation of the jthjth neuron in the lthlth layer. The following diagram shows examples of these notations in use:
With these notations, the activation aljajl of the jthjth neuron in the lthlthlayer is related to the activations in the (l−1)th(l−1)th layer by the equation (compare Equation (4) and surrounding discussion in the last chapter)
where the sum is over all neurons kk in the (l−1)th(l−1)th layer. To rewrite this expression in a matrix form we define a weight matrix wlwl for each layer, ll. The entries of the weight matrix wlwl are just the weights connecting to the lthlth layer of neurons, that is, the entry in the jthjth row and kthkth column is wljkwjkl. Similarly, for each layer ll we define a bias vector, blbl. You can probably guess how this works - the components of the bias vector are just the values bljbjl, one component for each neuron in the lthlth layer. And finally, we define an activation vector alal whose components are the activations aljajl.
The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as σσ. We met vectorization briefly in the last chapter, but to recap, the idea is that we want to apply a function such as σσ to every element in a vector vv. We use the obvious notation σ(v)σ(v) to denote this kind of elementwise application of a function. That is, the components of σ(v)σ(v) are just σ(v)j=σ(vj)σ(v)j=σ(vj). As an example, if we have the function f(x)=x2f(x)=x2then the vectorized form of ff has the effect
that is, the vectorized ff just squares every element of the vector.
With these notations in mind, Equation (23) can be rewritten in the beautiful and compact vectorized form
This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the σσ function**By the way, it‘s this expression that motivates the quirk in the wljkwjkl notation mentioned earlier. If we used jj to index the input neuron, and kk to index the output neuron, then we‘d need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That‘s a small change, but annoying, and we‘d lose the easy simplicity of saying (and thinking) "apply the weight matrix to the activations".. That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we‘ve taken to now. Think of it as a way of escaping index hell, while remaining precise about what‘s going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization. Indeed, the code in the last chapter made implicit use of this expression to compute the behaviour of the network.
When using Equation (25) to compute alal, we compute the intermediate quantity zl≡wlal−1+blzl≡wlal−1+bl along the way. This quantity turns out to be useful enough to be worth naming: we call zlzl theweighted input to the neurons in layer ll. We‘ll make considerable use of the weighted input zlzl later in the chapter. Equation (25) is sometimes written in terms of the weighted input, as al=σ(zl)al=σ(zl). It‘s also worth noting that zlzl has components zlj=∑kwljkal−1k+bljzjl=∑kwjklakl−1+bjl, that is, zljzjl is just the weighted input to the activation function for neuron jj in layer ll.
The goal of backpropagation is to compute the partial derivatives ∂C/∂w∂C/∂w and ∂C/∂b∂C/∂b of the cost function CC with respect to any weightww or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it‘s useful to have an example cost function in mind. We‘ll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the form
where: nn is the total number of training examples; the sum is over individual training examples, xx; y=y(x)y=y(x) is the corresponding desired output; LL denotes the number of layers in the network; and aL=aL(x)aL=aL(x) is the vector of activations output from the network when xx is input.
Okay, so what assumptions do we need to make about our cost function, CC, in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average C=1n∑xCxC=1n∑xCx over cost functions CxCx for individual training examples, xx. This is the case for the quadratic cost function, where the cost for a single training example is Cx=12∥y−aL∥2Cx=12∥y−aL∥2. This assumption will also hold true for all the other cost functions we‘ll meet in this book.
The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives ∂Cx/∂w∂Cx/∂w and ∂Cx/∂b∂Cx/∂b for a single training example. We then recover ∂C/∂w∂C/∂w and ∂C/∂b∂C/∂b by averaging over training examples. In fact, with this assumption in mind, we‘ll suppose the training example xx has been fixed, and drop the xx subscript, writing the cost CxCx as CC. We‘ll eventually put the xx back in, but for now it‘s a notational nuisance that is better left implicit.
The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:
For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example xx may be written as
and thus is a function of the output activations. Of course, this cost function also depends on the desired output yy, and you may wonder why we‘re not regarding the cost also as a function of yy. Remember, though, that the input training example xx is fixed, and so the outputyy is also a fixed parameter. In particular, it‘s not something we can modify by changing the weights and biases in any way, i.e., it‘s not something which the neural network learns. And so it makes sense to regard CC as a function of the output activations aLaL alone, with yymerely a parameter that helps define that function.
The backpropagation algorithm is based on common linear algebraic operations - things like vector addition, multiplying a vector by a matrix, and so on. But one of the operations is a little less commonly used. In particular, suppose ss and tt are two vectors of the same dimension. Then we use s⊙ts⊙t to denote theelementwise product of the two vectors. Thus the components of s⊙ts⊙t are just (s⊙t)j=sjtj(s⊙t)j=sjtj. As an example,
This kind of elementwise multiplication is sometimes called theHadamard product or Schur product. We‘ll refer to it as the Hadamard product. Good matrix libraries usually provide fast implementations of the Hadamard product, and that comes in handy when implementing backpropagation.
Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives ∂C/∂wljk∂C/∂wjkl and ∂C/∂blj∂C/∂bjl. But to compute those, we first introduce an intermediate quantity, δljδjl, which we call the error in the jthjth neuron in the lthlth layer. Backpropagation will give us a procedure to compute the error δljδjl, and then will relate δljδjl to ∂C/∂wljk∂C/∂wjkl and ∂C/∂blj∂C/∂bjl.
To understand how the error is defined, imagine there is a demon in our neural network:
The demon sits at the jthjth neuron in layer ll. As the input to the neuron comes in, the demon messes with the neuron‘s operation. It adds a little change ΔzljΔzjl to the neuron‘s weighted input, so that instead of outputting σ(zlj)σ(zjl), the neuron instead outputs σ(zlj+Δzlj)σ(zjl+Δzjl). This change propagates through later layers in the network, finally causing the overall cost to change by an amount ∂C∂zljΔzlj∂C∂zjlΔzjl.
Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they‘re trying to find a ΔzljΔzjl which makes the cost smaller. Suppose ∂C∂zlj∂C∂zjl has a large value (either positive or negative). Then the demon can lower the cost quite a bit by choosing ΔzljΔzjl to have the opposite sign to ∂C∂zlj∂C∂zjl. By contrast, if ∂C∂zlj∂C∂zjl is close to zero, then the demon can‘t improve the cost much at all by perturbing the weighted input zljzjl. So far as the demon can tell, the neuron is already pretty near optimal**This is only the case for small changes ΔzljΔzjl, of course. We‘ll assume that the demon is constrained to make such small changes.. And so there‘s a heuristic sense in which ∂C∂zlj∂C∂zjl is a measure of the error in the neuron.
Motivated by this story, we define the error δljδjl of neuron jj in layer llby
As per our usual conventions, we use δlδl to denote the vector of errors associated with layer ll. Backpropagation will give us a way of computing δlδl for every layer, and then relating those errors to the quantities of real interest, ∂C/∂wljk∂C/∂wjkl and ∂C/∂blj∂C/∂bjl.
You might wonder why the demon is changing the weighted input zljzjl. Surely it‘d be more natural to imagine the demon changing the output activation aljajl, with the result that we‘d be using ∂C∂alj∂C∂ajl as our measure of error. In fact, if you do this things work out quite similarly to the discussion below. But it turns out to make the presentation of backpropagation a little more algebraically complicated. So we‘ll stick with δlj=∂C∂zljδjl=∂C∂zjl as our measure of error**In classification problems like MNIST the term "error" is sometimes used to mean the classification failure rate. E.g., if the neural net correctly classifies 96.0 percent of the digits, then the error is 4.0 percent. Obviously, this has quite a different meaning from our δδ vectors. In practice, you shouldn‘t have trouble telling which meaning is intended in any given usage..
Plan of attack: Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error δlδl and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn‘t expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.
Here‘s a preview of the ways we‘ll delve more deeply into the equations later in the chapter: I‘ll give a short proof of the equations, which helps explain why they are true; we‘ll restate the equations in algorithmic form as pseudocode, and see how the pseudocode can be implemented as real, running Python code; and, in the final section of the chapter, we‘ll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch. Along the way we‘ll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural.
An equation for the error in the output layer, δLδL: The components of δLδL are given by
This is a very natural expression. The first term on the right, ∂C/∂aLj∂C/∂ajL, just measures how fast the cost is changing as a function of the jthjth output activation. If, for example, CC doesn‘t depend much on a particular output neuron, jj, then δLjδjL will be small, which is what we‘d expect. The second term on the right, σ′(zLj)σ′(zjL), measures how fast the activation function σσ is changing at zLjzjL.
Notice that everything in (BP1) is easily computed. In particular, we compute zLjzjL while computing the behaviour of the network, and it‘s only a small additional overhead to compute σ′(zLj)σ′(zjL). The exact form of ∂C/∂aLj∂C/∂ajL will, of course, depend on the form of the cost function. However, provided the cost function is known there should be little trouble computing ∂C/∂aLj∂C/∂ajL. For example, if we‘re using the quadratic cost function then C=12∑j(yj−aj)2C=12∑j(yj−aj)2, and so ∂C/∂aLj=(aj−yj)∂C/∂ajL=(aj−yj), which obviously is easily computable.
Equation (BP1) is a componentwise expression for δLδL. It‘s a perfectly good expression, but not the matrix-based form we want for backpropagation. However, it‘s easy to rewrite the equation in a matrix-based form, as
Here, ∇aC∇aC is defined to be a vector whose components are the partial derivatives ∂C/∂aLj∂C/∂ajL. You can think of ∇aC∇aC as expressing the rate of change of CC with respect to the output activations. It‘s easy to see that Equations (BP1a) and (BP1) are equivalent, and for that reason from now on we‘ll use (BP1) interchangeably to refer to both equations. As an example, in the case of the quadratic cost we have ∇aC=(aL−y)∇aC=(aL−y), and so the fully matrix-based form of (BP1)becomes
As you can see, everything in this expression has a nice vector form, and is easily computed using a library such as Numpy.
An equation for the error δlδl in terms of the error in the next layer, δl+1δl+1: In particular
where (wl+1)T(wl+1)T is the transpose of the weight matrix wl+1wl+1 for the (l+1)th(l+1)th layer. This equation appears complicated, but each element has a nice interpretation. Suppose we know the error δl+1δl+1at the l+1thl+1th layer. When we apply the transpose weight matrix, (wl+1)T(wl+1)T, we can think intuitively of this as moving the errorbackward through the network, giving us some sort of measure of the error at the output of the lthlth layer. We then take the Hadamard product ⊙σ′(zl)⊙σ′(zl). This moves the error backward through the activation function in layer ll, giving us the error δlδl in the weighted input to layer ll.
By combining (BP2) with (BP1) we can compute the error δlδl for any layer in the network. We start by using (BP1) to compute δLδL, then apply Equation (BP2) to compute δL−1δL−1, then Equation (BP2) again to compute δL−2δL−2, and so on, all the way back through the network.
An equation for the rate of change of the cost with respect to any bias in the network: In particular:
That is, the error δl