YouTube Video — Transcript

An in-depth explanation of the backpropagation algorithm using computation graphs to compute gradients in neural networks.

Key Takeaways

Backpropagation automates gradient computation, avoiding tedious manual differentiation.
Computation graphs visually and structurally represent functions and their gradients.
Basic gradient rules for sum, product, max, and logistic functions form building blocks.
Chain rule is efficiently applied by multiplying gradients along graph edges.
Understanding computation graphs demystifies gradient flow in neural network training.

Summary

Introduction to the backpropagation algorithm for automatic gradient computation.
Explanation of neural network structure and loss function in regression tasks.
Motivation for using computation graphs to simplify gradient calculations.
Definition and role of computation graphs as directed acyclic graphs representing mathematical expressions.
Examples of basic operations (addition, multiplication, max, logistic) and their gradients in graph form.
Demonstration of the chain rule applied through computation graphs for composite functions.
Detailed walkthrough of computing gradients for a hinge loss function using computation graphs.
Insight into how deep learning frameworks like TensorFlow and PyTorch use backpropagation internally.
Emphasis on modular structure and clarity gained by representing gradients with computation graphs.
Use of simple building blocks to compose complex gradient computations.

Full Transcript — Download SRT & Markdown

Speaker A

Hi, in this module I'm going to talk about the backpropagation algorithm for computing gradients automatically. It's generally associated with training neural networks, but it's actually a far more general algorithm.

Speaker A

So let's begin with our motivating example, which is suppose we're doing regression with a four-layer neural network.

Speaker A

So remember that we compute the loss on a given example, the loss with respect to a particular example, and now we have the weights of the network.

Speaker A

And remember the form of the neural network, you start with a feature vector, you multiply it by some weight matrix, that gives you a vector, you send it through the activation function, repeatedly apply it, apply a matrix, send it through an activation function, you take the final vector, you take the dot product with respect to the final weight vector, and that gives you your score.

Speaker A

So this is your prediction, subtract the target value, and you square it, that gives you your loss. So now if you wanted to train this neural network using stochastic gradient descent, you would need to compute the gradient of this loss function with respect to each of the parameters.

Speaker A

So for example, you would compute the gradient of the loss with respect to V1, that gives you a gradient update which you can then use to update V1.

Speaker A

Same with V2, V3, and W.

Speaker A

So now you can sit down with this lovely expression and you can just grind through the math and get the expressions, it's straightforward, but it's rather tedious. Another question is, how can you get the gradients without doing all this manual work?

Speaker A

So the answer to that is computation graphs.

Speaker A

So here is our loss function again, and what we're going to do is write down the computation graph for this mathematics, a computation graph is a directed acyclic graph whose root node represents the final mathematical expression and each node represents intermediate subexpressions.

Speaker A

Now what this computation graph is going to allow us to do is allows us to apply the backpropagation algorithm to the computation graph and automatically get gradients out.

Speaker A

So there's two purposes actually that we're going to do this, the first is computing the gradients automatically, and this is how deep learning packages like TensorFlow and PyTorch work behind the hood.

Speaker A

And second, we're going to use this as a tool to gain insight into the modular structure of the gradients and try to demystify, because taking gradients by hand, you can get into situations where you just have a lot of symbols, but using a graph, we can start to see the structure.

Speaker A

Okay, so our starting point is to think about functions as as boxes, so imagine you have this expression A + B, and that gives rise to some variable C, so I'm going to represent this as a very simple computation graph where you have A and B.

Speaker A

And these arrows point into this box that does plus, and the result will be labeled as as C here. Okay.

Speaker A

So now the question is, if I change A or B by a small amount, how much does C change?

Speaker A

So this is just the notion of a gradient, so informally, we can look at this as you have A plus B equals C, so now if I go and fiddle with A a little bit, I add epsilon, so what happens to the right hand side?

Speaker A

Well, on the right hand side, I just get plus one epsilon.

Speaker A

And what I'm going to do, so the gradient of C with respect to A is one, I'm just going to write it on the edge.

Speaker A

So this can be interpreted as kind of amplification or a gain.

Speaker A

If I move A by a little bit, this is kind of the multiplicative factor that C needs to be multiplied by.

Speaker A

So let's do the other side, so A plus B and you add a bit of noise to B, and again, you get one plus epsilon, so the gradient of C with respect to B is one as well.

Speaker A

Here's another example, C equals A times B, so as a computation graph, A and B go into this box, which takes the dot product, and you get C.

Speaker A

So what happens when you add epsilon noise to A, A plus epsilon times B is equal to C plus, and now you have B epsilon coming out.

Speaker A

So therefore, the gradient of C with respect to A is now B.

Speaker A

And analogously, we add noise to B, and we see that the the contribution to the output C is A times epsilon.

Speaker A

A epsilon.

Speaker A

Therefore, the gradient over here is A.

Speaker A

So this all should be kind of familiar.

Speaker A

I've just cast the sum and product rules for differentiation in graphical form.

Speaker A

So let's do a few more kind of small examples.

Speaker A

These small examples are going to be the building blocks.

Speaker A

It turns out that you can take these building blocks and compose them to build all sorts of more complicated functions.

Speaker A

So here's the example we saw before, A plus B, and the gradients are one and one.

Speaker A

A minus B, the gradients are one and minus one, because if you add epsilon to B, then this difference is going to go down by epsilon.

Speaker A

Here we saw this example, A times B, and the gradients are B and A.

Speaker A

Um, if you look at the squared function, A squared, the gradient with respect to the input is 2A.

Speaker A

So just the kind of the power rule.

Speaker A

Um, let's consider A and B where you take the max.

Speaker A

Okay.

Speaker A

So this one, uh, let's think about this.

Speaker A

So if I add a little bit of, add epsilon to A, how's that going to change the max?

Speaker A

Well, if A is greater than B, then it's just going to change the max by epsilon.

Speaker A

But if A is less than B, then it's going to be zero, because B is going to be the max.

Speaker A

Um, so the gradient of this max of A and B with respect to A is the indicator function of whether A is greater than B or not.

Speaker A

And symmetrically, um, the gradient with respect to B is whether B is greater than A.

Speaker A

Okay.

Speaker A

And finally, here is the logistic function, A sending it through this logistic function.

Speaker A

And a little bit of algebra, which I'll spare you of, produces this actually quite elegant expression, which is sigma times one minus sigma.

Speaker A

And you can kind of check that as A goes to infinity or minus infinity, remember the the sigmoid is going to saturate at one or zero, which means this gradient is actually going to go to zero.

Speaker A

So that's just a simple sanity check.

Speaker A

Okay, so these are the basic building blocks.

Speaker A

And that's really all the the kind of the um kind of the brute force of differentiation that we're going to really do, the rest is just composition.

Speaker A

So so now we take these building blocks and we put them together.

Speaker A

So here's a simple example, so suppose you take A, you square it, you get B.

Speaker A

And then you take B and you square it and you get C.

Speaker A

So by the building blocks from the previous slide, we know that the gradient on this edge is going to be two times the input here, which is B.

Speaker A

And the gradient along this edge is going to be two times A.

Speaker A

Okay, so now using these two, we can apply chain rule from calculus to compute the gradient of C with respect to A.

Speaker A

And this is going to be nothing more than just the product of those two quantities.

Speaker A

So in this case, we get 2B times 2A, and remember that B is equal to A squared.

Speaker A

Substitute that in, then you get 4A cubed, and remember C is A to the fourth.

Speaker A

So we can verify that this is indeed consistent with using the power rule.

Speaker A

Okay.

Speaker A

So in general, you can take compute these gradients by simply taking the product along edges.

Speaker A

And that's going to be really useful.

Speaker A

On this slide.

Speaker A

Okay, so now let's turn to our first example.

Speaker A

Um, the hinge loss for linear classification.

Speaker A

We actually did this one before, but I just want to do it again through the lens of computation graphs.

Speaker A

So here's the loss function.

Speaker A

And given this loss function, I'm going to construct the computation graph and then compute the gradient of the loss with respect to W.

Speaker A

So working bottom up, we have the weight vector.

Speaker A

And we have the feature vector, and we take the dot product, that gives us a score.

Speaker A

We take the score, we take Y, and multiply them together, that gives us the margin.

Speaker A

One minus the margin, and you take the max of that and zero, and you get the loss.

Speaker A

So another nice thing about the computation graph is it allows you to annotate these subexpressions and see how the computation and what the pieces are.

Speaker A

Okay, so now let us compute the gradient of the loss with respect to W.

Speaker A

And what I'm going to do here is all I need to do is compute the gradients along all these edges from loss down to W.

Speaker A

So let's begin at the top here.

Speaker A

So the gradient with and oh.

Speaker A

Here is our cheat sheet.

Speaker A

Don't forget the cheat sheet.

Speaker A

So we just now pattern match, here's a max over two things.

Speaker A

Well, what's on this edge?

Speaker A

It's first thing greater than second thing.

Speaker A

Okay, so the gradient here is going to be first thing, which is one minus margin, greater than the second thing, which is zero.

Speaker A

And now what about this edge?

Speaker A

So here is minus one.

Speaker A

So we just have minus one.

Speaker A

What about this times?

Speaker A

So times is the second input.

Speaker A

And the second input here is Y.

Speaker A

And here's another times.

Speaker A

The second input is phi of X.

Speaker A

Okay, so this allows us to think about the gradients one piece at a time.

Speaker A

And all the little edges are just invocations of this cheat sheet here.

Speaker A

Okay.

Speaker A

Now we're ready to read off the gradient of the loss with respect to W.

Speaker A

And this is just going to be product of all the edges.

Speaker A

Okay, so we have first, we have one minus margin greater than zero.

Speaker A

So that's, um, I'm going to rewrite that as margin less than one.

Speaker A

Verify that's the same thing.

Speaker A

We have a minus sign here.

Speaker A

That's a minus sign here.

Speaker A

And then we have Y.

Speaker A

And then we have phi of X.

Speaker A

You multiply them all together, and that's the expression.

Speaker A

And you can verify that this is indeed the gradient of the loss function.

Speaker A

Okay.

Speaker A

And just as another note, remember the gradient with respect to W is really think about perturbations.

Speaker A

If you change W by a little bit, how much is the loss going to change?

Speaker A

And the change is going to be the product of all these kind of amplifications evaluated at a particular point.

Speaker A

All right.

Speaker A

So now let's do neural networks now.

Speaker A

Um, so this is not going to be really anything new.

Speaker A

It's just going to be a different example.

Speaker A

So I'm going to do a two-layer layer neural network.

Speaker A

And I'm going to again build this computation graph up.

Speaker A

So you have the feature vector.

Speaker A

You have the first layer weight matrix V, you take the product.

Speaker A

And then you stick this through the activation function.

Speaker A

And we're going to label that H, which is the hidden feature vector.

Speaker A

And now we're going to take the dot product of W and H, that gives you the score.

Speaker A

Um, and then now it's uh score minus Y is a residual.

Speaker A

And the residual squared is a loss.

Speaker A

Okay, another aside is that the computation graph really allows you to see visually this modularity.

Speaker A

So the part up here is just the squared loss.

Speaker A

And the part down here is anything, any way of computing a score.

Speaker A

So before we had a linear classifier, a linear predictor, now we have a two-layer neural network.

Speaker A

It could be a four-layer neural network.

Speaker A

Which computation graph is just a bit.

Speaker A

Okay, so that's the computation graph.

Speaker A

Now let's uh to perform stochastic gradient descent, we need to compute the gradient with respect to both W and V.

Speaker A

Okay, so let's compute the gradient with respect to W of the loss.

Speaker A

Um, and what I'm going to do is look at the the edges and compute the gradients.

Speaker A

Okay, so here's our cheat sheet.

Speaker A

Um, so, okay, what goes on this edge?

Speaker A

What's the gradient of the squared?

Speaker A

This is just two times the input.

Speaker A

Which is in this case two times the residual.

Speaker A

Um, what about this edge?

Speaker A

So minus.

Speaker A

So this should just be a one.

Speaker A

On here.

Speaker A

And then what about this edge?

Speaker A

Uh, this is just going to be the second input.

Speaker A

You can find here.

Speaker A

So that is an H.

Speaker A

Okay, so now multiply all these things together.

Speaker A

And you get the gradient of the loss function with respect to W.

Speaker A

Okay.

Speaker A

All right, um, so one thing you can kind of double check is that we did do the gradient of the squared loss for linear predictors.

Speaker A

And it was also two times the residual times the feature vector instead of H, and now we just have H, which is a kind of a stand-in for the feature vector as far as W is concerned.

Speaker A

So that's kind of a nice sanity check.

Speaker A

All right.

Speaker A

So now let's do this more complicated one.

Speaker A

So this we want to compute the gradient with respect to V of loss of all the arguments.

Speaker A

And this equals.

Speaker A

Um, let's fill in all the edges.

Speaker A

Um, so first of all, notice that these two edges are actually in common with this path.

Speaker A

So we can go ahead and write them down.

Speaker A

One cool thing about computation graphs is it allows you to see the shared structure that gradients are actually have themselves also have common subexpressions.

Speaker A

Okay.

Speaker A

So now we need to do more work here.

Speaker A

So the gradient on this edge is going to be the other input, which is W.

Speaker A

Um, this is a sigma, so the gradient is going to be sigma of the input minus one minus sigma.

Speaker A

So this is going to just be H times one minus H.

Speaker A

Um, this uh this hollow circle here represents the element product of a vector.

Speaker A

So you just take two vectors and you multiply the elements together.

Speaker A

Um, and this is because this uh function is applying just element-wise.

Speaker A

And then what about this final edge?

Speaker A

This is just going to be phi of X, which is just the other input.

Speaker A

And now we can just multiply the rest of these things together.

Speaker A

So we have W times H times one minus H times phi of X transpose.

Speaker A

So there's a slight kind of annoyance here.

Speaker A

Because here we have V times phi of X, whereas before, um, there's no transpose here, because we just have W dot something.

Speaker A

And W dot is the same as W transpose.

Speaker A

Okay, so.

Speaker A

Um, but the the high level is that.

Speaker A

The product of all of these green pieces yields the gradient of the loss with respect to V.

Speaker A

Okay.

Speaker A

All right, so that finishes up this example.

Speaker A

So now we have mainly used this graphical representation to visualize the computation of function values and gradients.

Speaker A

Um, but, you know, the promise of backpropagation is that we didn't have to do any of that at all.

Speaker A

I just did that to kind of illustrate the inner workings of, um, gradient computations on the computation graph.

Speaker A

But now we're going to introduce the backpropagation algorithm, which is a general procedure for computing these gradients, so we never have to worry about it.

Speaker A

I'm going to do this backpropagation for a simple example.

Speaker A

Which is just the squared loss and linear regression.

Speaker A

And one note is that previously we've worked with, uh, you know, symbolic expressions.

Speaker A

Um, but the actual algorithm is going to operate on numbers usually.

Speaker A

So what I'm going to do is work with a concrete example and walk through the backpropagation algorithm with this example.

Speaker A

So the backpropagation algorithm includes two steps, a forward step and a backward step.

Speaker A

So in the forward step, what we're going to do is we're going to compute a bunch of forward values from the leaves to the root.

Speaker A

And each forward value is simply the value of that subexpression rooted at I.

Speaker A

The value could be a scalar, a vector, or a matrix.

Speaker A

So let's walk through this example here.

Speaker A

Okay, so at the leaves, we have, um, W, which is 3, 1.

Speaker A

And we have the feature vector 1, 2.

Speaker A

So now if you take, uh, these two quantities and you multiply, take the dot product, you get 3 + 2, which is 5.

Speaker A

Um, and now you take the score 5, and you take Y.

Speaker A

You subtract them.

Speaker A

And you get the residual, which is 3.

Speaker A

Notice that the forward values of this node is 5, and the forward value of this node is 3.

Speaker A

And now finally, you, uh, square this.

Speaker A

And the value of the squared is, uh, 3 squared, which is 9.

Speaker A

Forward value at this node is 9.

Speaker A

Okay, so now we're done with the forward phase.

Speaker A

All we've done is evaluated the loss.

Speaker A

But importantly, we have also remembered all the values along the way, which will become coming handy.

Speaker A

So now the backward step is we're going to compute a backward value GI one for every node.

Speaker A

And this is going to be the gradient of the loss with respect to the value at that node.

Speaker A

If that node changes value, how does the loss change?

Speaker A

So let's do this first example.

Speaker A

So the base case, the gradient of the loss with respect to loss is one.

Speaker A

Um, and now we, uh, look at the gradient along this edge.

Speaker A

We did this before, it's just two times the residual.

Speaker A

Okay, so now.

Speaker A

Um, we need to compute the backward value of this node.

Speaker A

To do that, we're going to take the backward value of the parent.

Speaker A

And multiply whatever's on this edge.

Speaker A

What's on this edge is two times the residual.

Speaker A

The residual is three.

Speaker A

So it's two times three, which is six.

Speaker A

And so one times six is six.

Speaker A

Notice that in computing this backward value, I'm using the intermediate computations for the forward pass.

Speaker A

Okay, so let's continue.

Speaker A

So the gradient on this edge is one.

Speaker A

So backward value here is six, which is the parent backward value, times what's on this edge, which is one.

Speaker A

That gives us six.

Speaker A

And then the backward value of this node is six times what's on this edge, which is this other input 1, 2.

Speaker A

And that gives us 6, 12.

Speaker A

So to conclude, the backpropagation algorithm takes these concrete values, this expression, and produces the gradient of the loss with respect to W evaluated at these concrete values.

Speaker A

And that's 6, 12.

Speaker A

Okay.

Speaker A

And the backpropagation algorithm, remember, works for any computation graph.

Speaker A

Four-layer neural networks, uh, much more complicated models.

Speaker A

But this is just a simple example to show you the dynamics of forward pass and backward pass.

Speaker A

Okay, so now we have the backpropagation algorithm.

Speaker A

We compute gradients, we stick these gradients into stochastic gradient descent.

Speaker A

And then we just run stochastic gradient descent, and then we get some weights.

Speaker A

Um, so now one question is, uh, what do we get?

Speaker A

Um, so we wanted to optimize the training loss using stochastic gradient descent.

Speaker A

Um, but running stochastic gradient descent, does it actually minimize these weights?

Speaker A

This is a little bit of a delicate question here.

Speaker A

So for linear predictors, it turns out that the training loss for a convex loss is going to be a convex function.

Speaker A

Which means that it is going to have a single global minimum.

Speaker A

Which means that if you start at some point and then you just follow your nose by running gradient descent with the appropriate step size, it's going to converge to the global optimum.

Speaker A

But for neural networks.

Speaker A

The training loss is non-convex.

Speaker A

And which means that, uh, there's no guarantees at all that you're going to converge to the global minimum.

Speaker A

If you're lucky, you can converge to a local minimum.

Speaker A

Um, so optimization of neural networks is in principle hard.

Speaker A

But of course, people do it anyway, and you, uh, actually get some good results.

Speaker A

So the there's a gap between theory and practice, which is, uh, not quite understood yet.

Speaker A

But in practice.

Speaker A

Um, getting neural networks to train properly is a little bit of an art.

Speaker A

I think of it as kind of like driving stick.

Speaker A

There's just a lot of, uh, degrees of freedom, you can stall and get stuck.

Speaker A

But if you know what you're doing, you can actually get a lot of good results.

Speaker A

Okay.

Speaker A

So here are some examples.

Speaker A

Just to give you a flavor of what needs to be done.

Speaker A

Okay, so here is a two-layer neural network.

Speaker A

And here is the loss function.

Speaker A

Um, the first is initialization matters.

Speaker A

So if you have a convex function, wherever you initialize, you run it for long enough, you converge to the global optimum.

Speaker A

Um, for a non-convex function, if you initialize here, you might get stuck up here.

Speaker A

If you initialize over here, you'll get stuck here and so on.

Speaker A

So generally, you have to be a little bit careful about how you initialize.

Speaker A

Um, you can't initialize at zero, um, because all.

Speaker A

It turns out that all the rows of your weight matrix are going to be identical, which is not very useful.

Speaker A

Um, so you typically initialize around zero with some amount of random noise.

Speaker A

Or you can use pre-training to initialize your network as well.

Speaker A

Um, which we won't cover, uh, right now.

Speaker A

Um, another thing that people do is called overparameterization.

Speaker A

So here, this corresponds to adding more hidden units than you kind of really need.

Speaker A

This corresponds to having a lot of rows of this, uh, of this matrix.

Speaker A

And the idea here is that the more hidden units you have, the more kind of chances you have of having the network learn something reasonable by your data.

Speaker A

So some of the units might die off and not be very useful.

Speaker A

But, you know, maybe like some fraction of them will actually be useful.

Speaker A

And the final thing that people do is using adaptive step sizes.

Speaker A

Which is generally an extension of stochastic gradient descent.

Speaker A

Remember stochastic gradient descent, you have a single step size, Ada, which controlled how fast you move.

Speaker A

With methods like AdaGrad or Adam, you actually get a per features or per parameter step size.

Speaker A

So for every weight, you get a number which dictates how fast you should be moving in that direction.

Speaker A

And this generally leads to better results.

Speaker A

Okay, so one maybe high-level thing to keep in mind is don't let your gradients vanish or explode.

Speaker A

So if I explain this, it will become kind of clear.

Speaker A

So when you run gradient descent or gradient stochastic gradient descent, if your gradients vanish, which means become too small or close to zero, then you won't make, you'll get stuck and you won't make any progress.

Speaker A

But if your gradients become too large, then you'll just explode and you will, uh, oscillate and, uh, might diverge.

Speaker A

So with careful initialization and setting of the step sizes, generally, uh, and even designing of the neural network architecture.

Speaker A

All of this is around kind of making sure that your gradients don't vanish or explode.

Speaker A

Okay, so that's all the guidance I'll provide you.

Speaker A

There's a lot more to be said on this topic, we're just kind of giving you a high-level overview.

Speaker A

Okay, so let's summarize now.

Speaker A

So the most important topic of this module is that of a computation graph.

Speaker A

This allows you to represent arbitrary mathematical expressions, and these expressions are built out of these simple building blocks.

Speaker A

And I hope that this, the idea of computation graphs will allow you to get a better visual understanding of what your mathematical expressions are doing.

Speaker A

And also what gradient computations are about.

Speaker A

And then we saw we had a backpropagation algorithm, which is this general purpose algorithm for leveraging the computation graph to make, uh, compute the gradients.

Speaker A

So notice that, uh, we've done this kind of in the context of neural networks, but I stress that computation graphs and backpropagation is fully general.

Speaker A

It allows you to handle many, many functions, and the generality is one of these reasons that you can, um, allows you to iterate very quickly on new types of models and loss functions.

Speaker A

And opens up this new paradigm for model development, differential programming, which we'll talk about in the future module.

Speaker A

All right, that's it.

Speaker A

Thanks.

Topics:backpropagationcomputation graphneural networksgradient computationchain rulestochastic gradient descenthinge lossdeep learningTensorFlowPyTorch

Frequently Asked Questions

What is the purpose of using computation graphs in backpropagation?

Computation graphs represent mathematical expressions as directed acyclic graphs, allowing automatic and modular gradient computation through backpropagation, simplifying and clarifying the differentiation process.

How does backpropagation relate to training neural networks?

Backpropagation computes the gradients of the loss function with respect to network parameters, enabling optimization algorithms like stochastic gradient descent to update weights and train neural networks effectively.

What are some basic operations and their gradients used in computation graphs?

Basic operations include addition (gradients of 1), multiplication (gradients depend on the other operand), max (gradient depends on which input is larger), and logistic functions (gradient is sigma times one minus sigma), which serve as building blocks for complex functions.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →