link here

Today’s lecture will be entirely devoted to the backpropagation algorithm. The heart of all deep learning.

link here

This is where we left things last lecture. We introduced gradient descent as the training algorithm for neural networks. The gradient collects all derivates of the loss with respect to every weight in the neural network.

For simple models, like a one-layer linear or logistic regression, we can work out the gradient in closed form. That means we get a function of what the gradient is at every point in our parameter space.

Fro complex, multilayer functions, working out the gradient is not so easy. We need an algorithm to do it for us. That algorithm is called backpropagation, and it’s what powers deep learning.

click image for animation
link here

Because backpropagation is so important, we are going to spend an entire lecture on it. Looking at it from different perspectives, and slowly building up to the completely automated, highly parallelized version of the algorithm that we use today.

In the first part we describe backpropagation in a scalar setting. That is, we will treat each individual element of the neural network as a single number, and simply loop over all these numbers to do backpropagation over the whole network. This simplifies the derivation, but it is ultimately a slow algorithm with a complicated notation.

In the second part, we translate neural networks to operations on vectors, matrices and tensors. This allows us to simplify our notation, and more importantly, massively speed up the computation of neural networks. Backpropagation on tensors is a little more difficult to do than backpropagation on scalars, but it's well worth the effort.

In the third part, we will make the final leap from manually worked out and implemented backpropagation system to full-fledged automatic differentiation: we will show you how to build a system that takes care of the gradient computation entirely by itself. This is the technology behind software like pytorch and tensorflow.

link here

In the previous video, we looked at what neural networks are, and we saw that to train them, we need to work out the derivatives of the loss with respect to the parameters of the neural network: collectively these derivatives are known as the gradient.

link here
link here


link here

Here is a diagram of the sort of network we’ll be encountering (this one is called the GoogLeNet). We can’t work out a complete gradient for this kind of architecture by hand. We need help. What we want is some sort of algorithm that lets the computer work out the gradient for us.

link here

Of course, working out derivatives is a pretty mechanical process. We could easily take all the rules we know, and put them into some algorithm. This is called symbolic differentiation, and it’s what systems like Mathematica and Wolfram Alpha do for us.

Unfortunately, as you can see here, the derivatives it returns get pretty horrendous the deeper the neural network gets. This approach becomes impractical very quickly.

Note that in symbolic differentiation we get a description of the derivative that is independent of the input. We get a function that we can then feed any input to.

link here

Another approach is to compute the gradient numerically. For instance by the method of finite differences: we take a small step ε and, see how much the function changes. The amount of change divided by the step size is a good estimate for the gradient if ε is small enough.

Numeric approaches are sometimes used in deep learning, but it’s very expensive to make them accurate enough if you have a large number of parameters.

Note that in the numeric approach, you only get an answer for a particular input. If you want to compute the gradient at some other point in space, you have to compute another numeric approximation. Compare this to the symbolic approach (either with pen and paper or through wolfram alpha) where once the differentation is done, all we have to compute is the derivative that we've worked out.

link here

Backpropagation is a kind of middle ground between symbolic and numeric approaches to working out the gradient. We break the computation into parts: we work out the derivatives of the parts symbolically, and then chain these together numerically.

The secret ingredient that allows us to make this work is the chain rule of differentiation.

link here

Here is the chain rule: if we want the derivative of a function which is the composition of two other functions, in this case f and g, we can take the derivative of f with respect to the output of g and multiply it by the derivative of g with respect to the input x.

Since we’ll be using the chain rule a lot, we’ll introduce a simple shorthand to make it a little easier to parse. We draw a little diagram of which function feeds into which. This means we know what the argument of each function is, so we can remove the arguments from our notation.

We call this diagram a computation graph. We'll stick with simple diagrams like this for now. At the end of the lecture, we will expand our notation a little bit to capture more detail of the computation.

click image for animation
link here

Since the chain rule is the heart of backpropagation, and backpropagation is the heart of deep learning, we should probably take some time to see why the chain rule is true at all.

If we imagine that f and g are linear functions, it’s pretty straightforward to show that this is true. They may not be, of course, but the nice thing about calculus is that locally, we can treat them as linear functions (if they are differentiable). In an infinitesimally small neighbourhood f and g are exactly linear.

If f and g are locally linear, we can describe their behavior with a slope s and an additive constant b. The slopes, sf and sg, are simply the derivatives. The additive constants we will show can be ignored.

In this linear view, what the chain rule says is this: if we approximate f(x) as a linear function, then its slope is the slope of f as a function of g, times the slope of g as a function of x. To prove that this is true, we just write down f(g(x)) as linear functions, and multiply out the brackets.

Note that this doesn’t quite count as a rigorous proof, but it’s hopefully enough to give you some intuition for why the chain rule holds.

click image for animation
link here

Since we’ll be looking at some pretty elaborate computation graphs, we’ll need to be able to deal with this situation as well: we have a computation graph, as before, but f depends on x through two different operations. How do we take the derivative of f over x?

The multivariate chain rule tells us that we can simply apply the chain rule along g, taking h as a constant, and sum it with the chain rule along h taking g as a constant.

click image for animation
link here

We can see why this holds in the same way as before. The short story: since all functions can be taken to be linear, their slopes distribute out into a sum

click image for animation
link here

If we have more than two paths from the input to the output, we simply sum over all of them.

link here

With that, we are ready to show how backpropagation works. We'll start with a fairly arbitrary function to show the principle before we move on to more realistic neural networks.

link here

The first thing we do is to break up its functional form into a series of smaller operations. The entire function f is then just a chain of these small operations chained together. We can draw this in a computation graph as we did before.

Normally, we wouldn’t break a function up in such small operations. This is just a simple example to illustrate the principle.

click image for animation
link here

Now, to work out the derivative of f, we can iterate the chain rule. We apply it again and again, until the derivative of f over x is expressed as a long product of derivatives of operation outputs over their inputs.

click image for animation
link here

We call the larger derivative of f over x the global derivative. And we call the individual factors, the derivatives of the operation output wrt to their inputs, the local derivatives.

link here

This is how the backpropagation algorithm combines symbolic and numeric computation. We work out the local derivatives symbolically, and then work out the global derivative numerically.

link here

For each local derivative, we work out the symbolic derivative with pen and paper.

Note that we could fill in the a, b and c in the result, but we don’t. We simply leave them as is. For the symbolic part, we are only interested in the derivative of the output of each sub-operation with respect to its immediate input.

The rest of thew algorithm is performed numerically.

click image for animation
link here

This we are now computing things numerically, we need a specific input, in this case x = -4.499. We start by feeding this through the computation graph. For each sub-operation, we store the output value. At the end, we get the output of the function f. This is called a forward pass: a fancy term for computing the output of f for a given input.

Note that at this point, we are no longer computing solutions in general. We are computing our function for a specific input. We will be computing the gradient for this specific input as well.

click image for animation
link here

Keeping all intermediate values from the forward pass in memory, we go back to our symbolic expression of the derivative. Here, we fill in the intermediate values a b and c. After we do this, we can finish the multiplication numerically, giving us a numeric value of the gradient of f at x = -4.499. In this case, the gradient happens to be 0.

click image for animation
link here

Before we try this on a neural network, here are the main things to remember about the backpropagation algorithm.

Note that backpropagation by itself does not train a neural net. It just provides a gradient. When people say that they trained a network by backpropagation, that's actually shorthand for training the network by gradient descent, with the gradients worked out by backpropagation.

link here

To explain how backpropagation works in a neural network, we extend our neural network diagram a little bit, to make it closer to the actual computation graph we’ll be using.

First, we separate the hidden node into the result of the linear operation ki and the application of the nonlinearity hi. Second, since we’re interested in the derivative of the loss rather than the output of the network, we extend the network with one more step: the computation of the loss (over one example to keep things simple). In this final step, the output y of the network is compared to the target value t from the data, producing a loss value.

The loss is the function for which we want to work out the gradient, so the computation graph is the one that computes first the model output, and then the loss based on this output (and the target).

Note that the model is now just a subgraph of the computation graph. You can think of t as another input node, like x1 and x2, (but one to which the model doesn’t have access).

click image for animation
link here

We want to work out the gradient of the loss. This is simply the collection of the derivative of the loss over each parameter.

We’ll pick two parameters, v2 in the second layer, and w12 in the first, and see how backpropagation operates. The rest of the parameters can be worked out in the same way to give us the rest of the gradient.

First, we have to break the computation of the loss into operations. If we take the graph on the left to be our computation graph, then we end up with the operations of the right.

To simplify things, we’ll compute the loss over only one instance. We’ll use a simple squared error loss.

click image for animation
link here

For the derivative with respect to v2, we’ll only need these two operations. Anything below doesn’t affect the result.

To work out the derivative we apply the chain rule, and work out the local derivatives symbolically.

click image for animation
link here

We then do a forward pass with some values. We get an output of 10.1, which should have been 12.1, so our loss is 4. We keep all intermediate values in memory.

We then take our product of local derivatives, fill in the numeric values from the forward pass, and compute the derivative over v2.

When we apply this derivative in a gradient descent update, v2 changes as shown below.

click image for animation
link here

Let’s try something a bit earlier in the network: the weight w12. We add two operations, apply the chain rule and work out the local derivatives.

click image for animation
link here

Note that when we’re computing the derivative for w12, we are also, along the way computing the derivatives for y, h2 and k2.

This useful when it comes to implementing backpropagation. We can walk backward down the computation graph and compute the derivative of the loss for every node. For the nodes below, we just multiply the local gradient. This means we can very efficiently compute any derivatives we need.

In fact, this is where the name backpropagation comes from: the derivative of the loss propagates down the network in the opposite direction to the forward pass. We will show this more precisely in the last part of this lecture.

click image for animation
link here

Here is a more abstract view of what is happening in backpropagation, which should apply to any kind of computation. We can think of the computations we do as modules with inputs and outputs, chained together to make a computation graph. The output of each module contributes ultimately to the result of the computation, which is the loss.

We want to know the derivative of the loss with respect to each of our input nodes. By the chain rule, this is the derivative of the los with respect to the output times the derivative of the output with respect to the input.

If we take care to walk back down our computation graph in the correct order, then we can be sure that for every module we encounter, we will have already computed the first factor. We only need to compute the second, and mulitply by the value we already have.

click image for animation
link here

link here

To finish up, let’s see if we can build a little intuition for what all these accumulated derivatives mean.

Here is a forward pass for some weights and some inputs. Backpropagation starts with the loss, and walks down the network, figuring out at each step how every value contributed to the result of the forward pass. Every value that contributed positively to a positive loss should be lowered, every value that contributed positively to a negative loss should be increased, and so on.

link here

We’ll start with the first value below the loss: y, the output of our model. Of course, this isn’t a parameter of the network, we can set it to any value we'd like. But let’s imagine for a moment that we could. What would the gradient descent update rule look like if we try to update y?

If the output is 10, and it should have been 0, then gradient descent on y tells us to lower the output of the network. If the output is 0 and it should have been 10, GD tells us to increase the value of the output.

Even though we can’t change y directly, this is the effect we want to achieve: we want to change the values we can change so that we achieve this change in y. To figure out how to do that, we take this gradient for y, and propagate it back down the network.

Note that even though these scenarios have the same loss (because of the square), the derivative of the loss has a different sign for each, so we can tell whether the output is bigger than the target or the other way around. The loss only tells us how bad we've done, but the derivative of the loss tells us where to move to make it better.

click image for animation
link here

Of course, we cannot change y directly. Instead, we have to change the values that influenced y.

Here we see what that looks like for the weights of the second layer. First note that the output y in this example was too high. Since all the hidden nodes have positive values (because of the sigmoid), we end up subtracting some positive value from all the weights. This will lower the output, as expected.

Second, note that the change is proportional to the input. The first hidden node h1 only contributes a factor of 0.1 (times its weight) to the value of y, so it isn't changed as much as h3, which contributes much more to the erroneous value.

Note also that the current value of the weight doesn’t factor into the update: whether v1 is 1, 10 or 100, we subtract the same amount. Only how much influence the weight had on the value of y in the forward pass if taken into account. The higher the activation of the source node, the more the weight gets adjusted.

Finally, note how the sign of the the derivative wrt to y is taken into account. Here, the model output was too high, so the more a weight contributed to the output, the more it gets "punished" by being lowered. If the output had been too low, the opposite would be true, and we would be adding something to the value of each weight.

click image for animation
link here

The sigmoid activation we’ve used so far allows only positive values to emerge from the hidden layer. If we switch to an activation that also allows negative activations (like a linear activation or a tanh activation), we see that backpropagation very naturally takes the sign into account.

In this case, we want to update in such a way that y decreases, but we note that the weight v2 is multiplied by a negative value. This means that (for this instance) v2 contributes negatively to the model output, and its value should be increased if we want to decrease the output.

link here

We use the same principle to work our way back down the network. If we could change the output of the second node h2 directly, this is how we’d do it.

Note that we now take the value of v2 to be a constant. We are working out partial derivatives, so when we are focusing on one parameters, all the others are taken as constant.

Remember, that we want to decrease the output of the network. Since v2 makes a negative contribution to the loss, we can achieve this by increasing the activation of the source node v2 is multiplied by.

link here

Moving down to k2, remember that the derivative of the sigmoid is the output of the sigmoid times 1 minus that output.

We see here, that in the extreme regimes, the sigmoid is resistant to change. The closer to 1 or 0 we get the smaller the weight update becomes.

This is actually a great downside of the sigmoid activation, and one of the big reasons it was eventually replaced by the ReLU as the default choice for hidden units. We’ll come back to this in a later lecture.

Nevertheless, this update rule tells us what the change is to k2 that we want to achieve by changing the gradients we can actually change (the weights of layer 1).

link here

Finally, we come to the weights of the first layer. As before, we want the output of the network to decrease. To achieve this, we want h2 to increase (because v2 is negative). However, the input x1 is negative, so we should decrease w12 to increase h2. This is all beautifully captured by the chain rule: the two negatives of x1 and v2 cancel out and we get a positive value which we subtract from w12.

link here

To finish up let's look at how you would implement this in code. Here is the forward pass: computing the model output and the loss, given the inputs and the target value.

Assume that k and y are initialized with 0s or random values. We'll talk about initialization strategies in the 4th lecture.

link here

And here is the backward pass. We compute gradients for every node in the network, regardless of whether the node represents a parameter. When we do the gradient descent update, we'll use the gradients of the parameters, and ignore the rest.

Note that we don’t implement the derivations from slide 44 directly. Instead, we work backwards down the neural network: computing the derivative of each node as we go, by taking the derivative of the loss over the outputs and multiplying it by the local derivative.

link here

link here

link here
link here



We’ve seen what neural networks are, how to train them by gradient descent and how to compute that gradient by backpropagation.

In order to scale up to larger and more complex structures, we need two make our computations as efficient as possible, and we’ll also need to simplify our notation. There’s one insight that we are going to get a lot of mileage out of.

link here

When we look at the computation of a neural network, we can see that most operations can be expressed very naturally in those of linear algebra.

The multiplication of weights by their inputs is a multiplication of a matrix of weights by a vector of inputs.

The addition of a bias is the addition of a vector of bias parameters.

The nonlinearities can be implemented as simple element-wise operations.

This perspective buys us two things. First…

click image for animation
link here

Our notation becomes extremely simple. We can describe the whole operation of a neural network with one simple equation. This expressiveness will be sorely needed when we move to more complicated networks.

link here

The second reason is that the biggest computational bottleneck in a neural network is the multiplication of the layer input by the layer weights. The matrix multiplication in the notation of the previous slide. This operation is more than quadratic while everything else is linear. We can see that in our pseudocode, because we have one loop, nested in another.

Matrix multiplication (and other tensor operations like it) can be parallelized and implemented efficiently but it’s a lot of work. Ideally, we’d like to let somebody else do all that work (like the implementers of numpy) and then just call their code to do the matrix multiplications.

This is especially important if you have access to a GPU. A matrix multiplication can easily be offloaded to the GPU for much faster processing, but for a loop over an array, there’s no benefit.

This is called vectorizing: taking a piece of code written in for loops, and getting rid of the loops by expressing the function as a sequence of linear algebra operations.

link here

Without vectorized implementations, deep learning would be painfully slow, and GPUs would be useless. Turning a naive, loop-based implementation into a vectorized one is a key skill for DL researchers and programmers.

link here

Here’s what the vectorized forward pass looks like as a computation graph, in symbolic notation and in pseudocode.

When you implement this in numpy it’ll look almost the same.

So far so good. The forward pass is easy enough to vectorize.

click image for animation
link here

Of course, we lose a lot of the benefit of vectorizing if the backward pass (the backpropagation algorithm) isn’t also expressed in terms of matrix multiplications. How do we vectorize the backward pass? That's the question we'll answer in the rest of this video.

On the left, we see the forward pass of our loss computation as a set of operations on vectors and matrices.

To generalize backpropagation to this view, we might ask if something similar to the chain rule exists for vectors and matrices. Firstly, can we define something analogous to the derivative of one matrix over another, and secondly, can we break this apart into a product of local derivatives, possibly giving us a sequence of matrix multiplications?

The answer is yes, there are many ways of applying calculus to vectors and matrices, and there are many chain rules available in these domains. However, things can quickly get a little hairy, so we need to tread carefully.

click image for animation
link here

The derivatives of high-dimensional objects are easily defined. We simply take the derivative of every number in the input over every number in the output, and we arrange all possibilities into some logical shape. For instance, if we have a vector-to-vector function, the natural analog of the derivative is a matrix with all the partial derivatives in it.

However, once we get to matrix/vector operations or matrix/matrix operations, the only way to logically arrange every input with every output is a tensor of higher dimension than 2.

NB: We don’t normally apply the differential symbol to non-scalars like this. We’ll introduce better notation later.

click image for animation
link here

As we see, even if we could come up with this kind of chain rule, one of the local derivatives is now a vector over a matrix. The result could only be represented in a 3-tensor. There are two problems with this:

If the operation has n inputs and n outputs, we are computing n3 derivatives, even though we were only ultimately interested in n2 of them (the derivatives of W). In the scalar algorithm we only ever had two nested loops (an n2 complexity), and we only ever stored one gradient for one node in the computation graph. Now we suddenly have n3 complexity and n3 memory use. We were supposed to be making things faster.

We can easily represent a 3-tensor, but there’s no obvious, default way to multiply a 3-tensor with a matrix, or with a vector (in fact there are many different ways). The multiplication of the chain rule becomes very complicated this way.

In short, we need a more careful approach.

link here

To work out how to do this we make these following simplifying assumptions.

click image for animation
link here

Note that this doesn’t mean we can only ever train neural networks with a single scalar output. Our neural networks can have any number of outputs of any shape and size. We can make neural networks that generate images, natural language, raw audio, even video.

However, the loss we define over those outputs needs to map them all to a single scalar value. The computation graph is always the model, plus the computation of the loss.

link here

We call this derivative the gradient of W. This is a common term, but we will deviate from the standard approach in one respect.

Normally, the gradient is defined as a row vector, so that it can operate on a space of column vectors by matrix multiplication.

In our case, we are never interested in multiplying the gradient by anything. We only ever want to sum the gradient of l wrt W with the original matrix W in a gradient update step. For this reason we define the gradient as having the same shape as the tensor W with respect to which we are taking the derivative.

In the example shown, W is a 3-tensor (a kind of matrix with 3 dimensions instead of 2). The gradient of l wrt W has the same shape as W, and at element (i, j, k) it holds the scalar derivative of l wrt Wijk.

With these rules, we can use tensors of any shape and dimension and always have a welldefined gradient.

click image for animation
link here

The standard gradient notation isn’t very suitable for our purposes. It puts the loss front and center, but that will be the same in all cases. The object that we’re actually interested in is relegated to a subscript. Also, it isn’t very clear from the notation what the shape is of the tensor that we’re looking at.

For this reason, we’ll introduce a new notation. This isn’t standard, so don’t expect to see it anywhere else, but it will help to clarify our mathematics a lot as we go forward.

We’ve put the W front and center, so it’s clear that the result of taking the gradient is also a matrix, and we’ve removed the loss, since we’ve assume that we are always taking the gradient of the loss.

The notation works the same for vectors and even for scalars. This is the gradient of l with respect to W. Since l never changes, we’ll refer to this as “the gradient for W”.

click image for animation
link here

When we refer to a single element of the gradient, we will follow our convention for matrices, and use the non-bold version of its letter.

Element (i, j) for the gradient of W is the same as the gradient for element (i, j) of W.

To denote this element we follow the convention we have for elements of tensors, that we use the same letter as we use for the matrix, but non-bold.

link here

This gives us a good way of thinking about the gradients, but we still don’t have a chain rule to base out backpropagation on.

The main trick we will use is to stick to scalar derivatives as much as possible.

Once we have worked out the derivative in purely scalar terms (on pen and paper), we will then find a way to vectorize their computation.

click image for animation
link here

We start simple: what is the gradient for y? This is a vector, because y is a vector. Let’s first work out the derivative of the i-th element of y. This is purely a scalar derivative so we can simply use the rules we already know. We get 2(yi - ti) for that particular derivative.

Then, we just re-arrange all the derivatives for yi into a vector, which gives us the complete gradient for y.

The final step requires a little creativity: we need to figure out how to compute this vector using only basic linear algebra operations on the given vectors and matrices. In this case it’s not so complicated: we get the gradient for y by element-wise subtracting t from y and multiplying by 2.

We haven’t needed any chain rule yet, because our computation graph for this part has only one edge.


click image for animation
link here

Let’s move one step down and work out the gradient for V. We start with the scalar derivative for an arbitrary element of V: Vij.

Now, we need a chain rule, to backpropagate through y. However, since we are sticking to scalar derivatives, we can simply use the scalar multivariate chain rule. This tells us that however many intermediate values we have, we can work out the derivative for each, keeping the others constant, and sum the results.

This gives us the sum in the second equality. We've worked out the gradient y already, so we can fill that in and focus on the derivative of yk over Vij. the value of yk is defined in terms of linear algebra operations like matrix multiplication, but with a little thinking we can always rewrite these as a simple scalar sums.

In the end we find that the derivative of yk over Vij reduces to the value of hj.

This tells us the values of every elemen (i, j) of V. All that's left is to figure out how to compute this in a vectorized way. In this case, we can compute Vas the outer product of the gradient for y, which we've computed already, and the vector h, which we can save during the forward pass.

click image for animation
link here

link here

Since this is an important principle to grasp, let’s keep going until we get to the other set of parameters, W. We’ll leave the biases as an exercise.

For the gradient for h, most of the derivation is the same as before, until we get to the point where the scalar derivative is reduced to a matrix mulitplication. Unlike the previous derivation, the sum over k doesn't disappear. We need to take it into the description of the gradient for h.

To vectorize this, we note that each element of this vector is a dot product of y and a column of V. This means that if we premultiply y (transposed) by V, we get the required result as a row vector. Transposing the result (to make it a column vector) is equivalent to post multiplying y by the transpose of V.

Note the symmetry of the forward and the backward operation.

click image for animation
link here

Now, at this point, when we analyze k, remember that we already have the gradient over h. This means that we no longer have to apply the chain rule to anything above h. We can draw this scalar computation graph, and work out the local gradient for k in terms for the gradient for h.

Given that, working out the gradient for k is relatively easy, since the operation from k to h is an element-wise one.

click image for animation
link here

Finally, the gradient for W. The situation here is exactly the same as we saw earlier for V (matrix in, vector out, matrix multiplication), so we should expect the derivation to have the same form (and indeed it does).

click image for animation
link here

Here’s the backward pass that we just derived.

We’ve left the derivatives of the bias parameters out. You’ll have to work these out to implement the first homework exercise.

link here

link here

link here
link here



We’ve simplified the computation of derivatives a lot. All we’ve had to do is work out local derivatives and chain them together. We can go one step further: if we let the computer keep track of our computation graph, and provide some backwards functions for basic operations, the computer can work out the whole backpropagation algorithm for us. This is called automatic differentiation (or sometimes autograd).

link here

link here

This is what we want to achieve. We work out on pen and paper the local derivatives of various modules, and then we chain these modules together in a computation graph in our code. The computer keeps the graph in memory, and can automatically work out the backpropagation.

link here

Whenever we work out a gradient, as we did in the previous video, we always look at the node above the one for which we're computing the gradient, and use the multivariate chain rule to split the gradient in two. In this case, we are working out the gradient for W and we apply the multivariate chain to break that gradient into the gradient for k and the derivatives of k with respect to W.

The key idea, that powers automatic differentiation is that once we have k, we no longer care about anything else that happens above k in our computation graph. All we need is k and we can work out W nabla.

click image for animation
link here

link here

This kind of algorithm is called automatic differentiation. What we’ve been doing so far is called backward, or reverse mode automatic differentiation. This is efficient if you have few output nodes. Note that if we had two output nodes, we’d need to do a separate backward pass for each.

If you have few inputs, it’s more efficient to start with the inputs and apply the chain rule working forward. This is called forward mode automatic differentiation.

Since we assume we only have one output node (where the gradient computation is concerned), we will always use reverse mode in deep learning.

click image for animation
link here

To make this idea precise, and applicable to as broad a range of functions as possible, we'll need to set up a few ingredients. Our values with be tensors: scalars, vectors, matrices, or their higher-dimensional analogues.

We then build computation graphs out of operations

link here

The basic datastructure of our system will be the tensor. A tensor is a generic name for family of datastructures that includes a scalar, a vector, a matrix and so on.

There is no good way to visualize a 4-dimensional structure. We will occasionally use this form to indicate that there is a fourth dimension along which we can also index the tensor.

We will assume that whatever data we work with (images, text, sounds), will in some way be encoded into one or more tensors, before it is fed into our system. Let’s look at some examples.

click image for animation
link here

A simple dataset, with numeric features can simply be represented as a matrix. For the labels, we usually create a separate corresponding vector for the labels.

Any categoric features or labels should be converted to numeric features (normally by one-hot coding).

click image for animation
link here

Images can be represented as 3-tensors. In an RGB image, the color of a single pixel is represented using three values between 0 and 1 (how red it is, how green it is and how blue it is). This means that an RGB image can be thought of as a stack of three color channels, represented by matrices.

This stack forms a 3-tensor.

link here

If we have a dataset of images, we can represent this as a 4-tensor, with dimensions indexing the instances, their width, their height and their color channels respectively. Below is a snippet of code showing that when you load the CIFAR10 image data in Keras, you do indeed get a 4-tensor like this.

There is no agreed standard ordering for the dimensions in a batch of images. Tensorflow and Keras use (batch, height, width, channels), whereas Pytorch uses (batch, channels, height, width).

(You can remember the latter with the mnemonic “bachelor chow”.)

link here

link here

We’ll need to be a little mode precise in our notations. From now on we’ll draw computation graphs with both the operations and the values as nodes. The graph is always bipartite and directed. A tensor node is connected to the ops for which it is an input, and an operation node is connected to the tensor nodes that it produces as outputs.

link here

As an example, here is our MLP expressed in our new notation.

Just as before, it’s up to us how granular we make the computation. for instance, we wrap the whole computation of the loss in a single operation, but we could also separate out the subtraction and the squaring.

You may note that this graph is not directed or bipartite everywhere. This is just visual shorthand. Occasionally, we will omit the direction of an edge for clarity, or omit intermediate nodes when it's clear that a tensor node or op node should be there. The actual computation graph we are describing in such cases is always bipartite and entirely directed.

link here

We hold on to the same assumptions we had before. They are required for automatic differentiation to work efficiently.

link here

To store a computation graph in memory, we need three classes of objects. The first two are shown here.

A TensorNode objects holds the tensor value at that node. It holds the gradient of the tensor at that node (to be filled by the backpropagation algorithm) and it holds a pointer to the Operation Node that produced it (which can be null for leaf nodes).

An OpNode object represents an instance of a particular operation being applied in the computation. It stores the particular op being used (mulitplication, additions, etc) and the inputs and the outputs. In the MLP example, we perform several matrix mulitplications: each of these becomes a separate OpNode in the computation graph, all recording an instance of the matrix multiplication operation being used. Each of them refers to a single object representing this operation.

This is the third class: the Op.

link here

If the difference between an Op and an OpNode is unclear consider that a single neural network may apply, say, a sigmoid operation many times. The Op object for the sigmoid defines how to compute the sigmoid operations (and its gradient). Then for each place in the computation graph where the sigmoid is applied, we create a separate OpNode object which references the inputs and outputs specific to that application of the sigmoid, together with a single pointer to the single sigmoid Op. This is where we actually define how to perform the computation.

An Op is defined by two functions.

The function forward computes the outputs given the inputs (just like any function in any programming language).

The function backward takes the gradient for the outputs (the gradient of the loss wrt to the outputs) and produces the gradient for the inputs.

Both functions are also given a context object. This is a data structure (a dictionary or a list) to which the forward can add any value which it needs to save for the backward, we'll see an example of this later.

click image for animation
link here

We'll see how to implement ops later. First, let's see how all this works put together. Using Ops, OpNodes and TensorNodes, we can build a computation graph (again, more on how we do that later). The OpNodes reference the TensorNodes that are their inputs and outputs and they reference the Ops that contain the actual code for their forward and backward.

Once we have this computation graph, computing our model output and loss and performing backpropagation becomes almost trivially easy. We simply traverse the tree forwards from the inputs to the loss calling all the forward functions, and then backwards from the loss to the inputs, calling all the backwards functions.

The only thing we need to keep track of is that when we call forward() on an OpNode's Op, all the ancestors of that OpNode have been called, so that we have all its inputs. And, when we traverse the tree backward, when we call backward() on an OpNode's Op that all the descendants have been called so that we have the gradients for all of its outputs.

link here

All this assumes that the computation graph is already there. But how do we tell the system what our computation graph should be? It turns out that the nicest way to do this is to describe the computation itself in code. This makes sense: code is a language explicitly for describing computations, so it would be nice to simply describe a computation as a piece of Python code, and have the system build a computation graph for us.

This can be done through the magic of operator overloading. We simply change change what operators like * and + mean, when they are applied to TensorNode objects. There are two strategies common: lazy and eager execution.

link here

Here’s one example of how a lazy execution API might look.

Note that when we’re building the graph, we’re not telling it which values to put at each node. We’re just defining the shape of the computation, but not performing the computation itself.

When we create the model, we define which nodes are the input nodes, and which node is the loss node. We then provide the input values and perform the forward and backward passes.

click image for animation
link here

In lazy execution, which was the norm during the early years of deep learning, we build the computation graph, but we don’t yet specify the values on the tensor nodes. When the computation graph is finished, we define the data, and we feed it through.

This is fast since the deep learning system can optimize the graph structure during compilation, but it makes models hard to debug: if something goes wrong during training, you probably made a mistake while defining your graph, but you will only get an error while passing data through it. The resulting stack trace will never point to the part of your code where you made the mistake.

link here

In eager mode deep learning systems, we create a node in our computation graph (a TensorNode) by specifying what data it should contain. The result is a tensor object that stores both the data, and the gradient over that data (which will be filled later).

Here we create the variables a and b. If we now apply an operation to these, for instance to multiply their values, the result is another variable c. Languages like python allow us to overload the * operator it looks like we’re just computing multiplication, but behind the scenes, we are creating a computation graph that records all the computations we’ve done.

We compute the data stored in c by running the computation immediately, but we also store references to the variables that were used to create c, and the operation that created it. We create the computation graph on the fly, as we compute the forward pass.

Using this graph, we can perform the backpropagation from a given node that we designate as the loss node (node c in this case). We work our way down the graph computing the derivative of each variable with respect to c. At the start the TensorNodes do not have their grad’s filled in, but at the end of the backward, all gradients have been computed.

Once the gradients have been computed, and the gradient descent step has been performed, we clear the computation graph. It's rebuilt from scratch for every forward pass.

click image for animation
link here

In eager execution, we simply execute all operations immediately on the data, and collect the computation graph on the fly. We then execute the backward and ditch the graph we’ve collected.

This makes debugging much easier, and allows for more flexibility in what we can do in the forward pass. It can, however be a little difficult to wrap your head around. Since we’ll be using pytorch later in the course, we’ll show you how eager execution works step by step.

link here

link here

The final ingredient we need is a large collection of operations with backward functions worked out. We’ll show how to do this for a few examples.

First, an operation that sums two matrices element-wise.

The recipe is the same as we saw in the last part:

Write out the scalar derivative for a single element.

Use the multivariate chain rule to sum over all outputs.

Vectorize the result.

Note that when we draw the computation graph, we can think of everything that happens between S and l as a single module: we are already given the gradient of l over S, so it doesn’t matter if it’s one operation or a million.

Again, we can draw the computation graph at any granularity we like: very fine individual operations like summing and multiplying or very coarse-grained operations like entire NNs.

For this operation, the context object is not needed, we can perform the backward pass without remembering anything about the forward pass.

click image for animation
link here

To finish up, we’ll show you the implementation of some more operations. You’ll be asked to do a few of these in the second homework exercise.

link here

This is a pretty simple derivation, but it shows two things:

We can easily do a backward over functions that output high dimensional tensors, but we should sum over all dimensions of the output when we apply the multivariate chain rule.

The backward function illustrates the utility of the context object. To work out the backward, we need to know the original input. We could just have stored that for every operation, but as we see here, we’ve done part of the computation already in the forward (and in other backwards, like the sum we saw earlier, the inputs aren’t necessary at all). If we give the operation control over what information it wants to store for the backward pass, we are flexible to optimize for efficiency.

click image for animation
link here

Note that the output is a vector, because we’ve summed out one dimension. Therefore, when we apply the multivariate chain rule, we sum over only one dimension.

This is a sum like the earlier example, so we don’t need to save any tensor values from the forward pass. However, we do need to remember the size of the dimension we summed out. Therefore, we use the context to store this information. The None keyword in the slice in the last line adds a singleton dimension: it turns the shape from (n) into (n, 1) .

click image for animation
link here

Our final example includes a constant argument. We take a scalar, and expand it to a matrix of the given size, filled with the value x.

In this case, we do not care about the gradient for size (and even if we did, integer values don’t yield meaningful gradients without some extra work). We consider this a constant.

This can be done in different ways. In pytorch, you return None for the gradient for a constant (as we’ve shown here).

In our toy system (vugrad), we consider all keyword arguments constants.

click image for animation
link here

link here

Most deep learning frameworks also have a way of combining model parameters and computation into a single unit, often called a module or a layer.

In this case a Linear module (as it is called in Pytorch) takes care of implementing the computation of a single layer of a neural network (sans activation) and of remembering the weights and the bias.

Modules have a forward function which performs the computation of the layer, but they don’t have a backward function. They just chain together operations, and the backpropagation calls the backwards on the operations. In other words, it doesn’t matter to the backpropagation whether we use a module or apply all the operations by hand.

link here

Which completes the picture we wanted to create: a system where all we have to do is perform some computations on some tensors, just like we would do in numpy, and all the building of computation graphs and all the backpropagating is handled for us automatically.

So long as the forward and backward have been defined for all the operations we want to compute, either by us, or as part of a standard library, the system will do the rest.

link here

So

link here

link here


These slides were cut to lighten the amount of material covered in this lecture. They're not part of the exam material, but the information is still useful if you're working with DL systems. It’s good to skim them quickly, so you know to come back here when you encounter any of the issues we describe here.

link here

It’s important to realize that even though we think of tensors as multidimensional arrays, in memory, they are necessarily laid out as a single line of numbers. The Tensor object knows the shape that these numbers should take, and uses that to compute whatever we need it to compute.

We can do this in two ways: scanning along the rows first and then the columns is called row-major ordering. This is what numpy and pytorch both do. The other option is (unsurprisingly) called column major ordering. In higher dimensional tensors, the dimensions further to the right are always scanned before the dimensions to their left.

Imagine looping over all elements in a tensor with shape (a, b, c, d) in four nested loops. If you want to loop over the elements in the order they are in memory, then the loop over d would be the innermost loop, then c, then b and then a.

This allows us to perform some operations very cheaply, but we have to be careful.

click image for animation
link here

Here is one such cheap operation: a reshape. This changes the shape of the tensor, but not the data.

To do this cheaply, we can create a new tensor object with the same data as the old one, but with a different shape.

link here

link here

Imagine that we wanted to scale a dataset of images in such way that over the whole dataset, each color channel has a maximal value of 1 (independent of the other channels).

To do this, we need a 3-vector of the maxima per color channel. Pytorch only allows us to take the maximum in one direction, but we can reshape the array so that all the directions we’re not interested in are collapsed into one dimension.

Afterwards, we can reverse the reshape to get our image dataset back.

We have to be careful: the last three lines in this slide form a perfectly valid reshape, but the c dimension in the result does not contain our color values.

In general, you’re fine if you collapse dimensions that are next to each other, and uncollapse them in the same order.

click image for animation
link here

A transpose operation can also be achieved cheaply. In this case, we just keep a reference to the old tensor, and whenever a user requests the element (i, j) we return (j, i) from the original tensor.

NB: This isn’t precisely how pytorch does this, but the effect is the same: we get a transposed tensor in constant time, by viewing the same data in memory in a different way.

click image for animation
link here

Even slicing can be accomplished by referring to the original data.

click image for animation
link here

For some operations, however, the data needs to be contiguous. That is, the tensor data in memory needs to be one uninterrupted string of data in row major ordering with no gaps. If this isn’t the case, pytorch will throw an exception like this.

You can fix this by calling .contiguous() on the tensor. The price you pay is the linear time and space complexity of copying the data.