YouTube Video — Transcript

Introduction to neural networks, problem decomposition, and activation functions like ReLU for non-linear prediction.

Key Takeaways

Neural networks enable non-linear prediction by decomposing complex problems into simpler subproblems.
Linear classifiers cannot solve problems like XOR, but neural networks can by combining multiple linear tests.
Activation functions like ReLU are essential to avoid vanishing gradients and enable effective training.
Matrix and vector operations provide a compact way to represent neural network computations.
Replacing step functions with smooth or piecewise linear activations facilitates gradient-based optimization.

Summary

The video introduces neural networks as a method for constructing non-linear predictors through problem decomposition.
It contrasts linear predictors with non-linear predictors and explains the limitations of linear classifiers using the XOR problem.
A motivating example involving predicting car collisions based on their positions is used to illustrate the concept.
The problem is decomposed into subproblems, each tested with linear functions, and combined to form the final prediction.
Vector and matrix notation is introduced to represent the hypothesis class and combine subproblems.
The video discusses the challenge of optimizing zero-one loss due to zero gradients in step functions.
Activation functions are introduced to solve the gradient problem, starting with the logistic function.
The ReLU activation function is presented as a superior alternative due to its non-vanishing gradients and simplicity.
The concept of replacing threshold functions with activation functions to enable gradient-based learning is emphasized.
The video prepares to define two-layer neural networks using the introduced concepts.

Full Transcript — Download SRT & Markdown

Speaker A

Hi, in this module I'm going to talk about neural networks, a way to construct non-linear predictors via problem decomposition.

Speaker A

So, when we started, we talked about linear predictors, and they were two linear in two ways, first is that the feature vector was linear function of X, and the way that the feature vector interacted with the prediction was also linear, this gave you rise to lines.

Speaker A

Next, we talked about non-linear predictors, but keeping the same linear machinery, but just playing around with the feature vector, and by adding terms like X squared, you could get quadratic predictors and so on.

Speaker A

So now what we're going to do is we're going to define neural networks where we can just leave P of X alone, the feature vector alone, and play with the way that the feature vector uh results in the prediction, and that allows us to get all sorts of fancy stuff.

Speaker A

So, let me begin with a motivating example.

Speaker A

So, suppose you're trying to predict whether two cars are going to collide or not, so the input are the positions of two oncoming cars, X is X1, X2, so X1 is the position of car one, and X2 is the position of car two, and what you'd like to output is whether Y equals one, whether there's it's safe, or Y equals minus one, whether they're collide or not.

Speaker A

And what is unknown here, um, is that we're going to say that cars are safe if they're sufficiently far, so if the distance between them is at least one, then we're going to be safe.

Speaker A

We can visualize this a true um predictor as follows.

Speaker A

So, here is X1 and X2.

Speaker A

And, um, what is going to happen is you're going to draw these two lines here, and anything, any point that is over here, uh, and anything that is over here is going to be labeled as, uh, plus, which is safe, and anything that's in between is going to be labeled as minus, or that they'll collide.

Speaker A

Okay, so let's do some examples here, so suppose we have a point zero two, which is this point here, um, this is safe.

Speaker A

So, Y goes one, two zero is also safe, and zero zero is, uh, here, which is not safe, and two two is minus one, which is also not safe.

Speaker A

Okay, so as an aside, this configuration points is what was historically known as the XOR problem, and it was shown that pure linear classifiers could not be used to solve this problem, you couldn't draw a line to separate the blue and the orange points, but nonetheless, we're going to show how neural networks can be used to solve this.

Speaker A

Okay, so the key intuition is the idea of problem decomposition.

Speaker A

So, instead of solving the problem all at once, we're going to decompose it into two subproblems, the first, we're going to test if car one is to the far right of car two, and in the picture here, that corresponds to simply this region over here, which we're going to call H1.

Speaker A

So, H1 is whether X1 minus X2 is greater than equal to one.

Speaker A

And then we're going to find another subproblem, testing whether car two is to the far right of car one, which is called H2.

Speaker A

That corresponds to this region over here, and then we're going to predict safe if at least one of them is true.

Speaker A

So, we just add the two here, which is either one or zero, and if at least one of them is one, then we're going to return plus one, and by convention, we're going to assume that the sign of zero is, uh, minus one.

Speaker A

Okay, so, um, here are some examples here, so suppose we have zero two again, so this point, um, H1 says, nope, that's not on my side.

Speaker A

H2 says, yep, that's on my side, and at least one is enough to make the prediction plus one.

Speaker A

If you take two zero, that's this point, um, H1 says, yep, H2 says, nope, and then is one.

Speaker A

Because all it takes is one, zero zero is this point.

Speaker A

Both of them say no, and it's minus one.

Speaker A

And same with two two, both of them say no.

Speaker A

It's minus.

Speaker A

Okay, so so far we've just defined the true function F, um, of course, we don't know F.

Speaker A

So, what we're going to do is try to move gradually to defining a hypothesis class.

Speaker A

And the first next step is to rewrite F using vector notation.

Speaker A

So, here are the two intermediate subproblems.

Speaker A

And the predictor is F of X equals the sign.

Speaker A

And what we're going to do is to write this in terms of a dot product between a wave vector and a feature vector.

Speaker A

So, here's the feature vector, one, X1, X2.

Speaker A

And then we're going to define a wave vector, which is minus one.

Speaker A

And if you look at the dot product, it's going to be, so it's minus one, plus X1, minus X2.

Speaker A

And if that quantity is greater than zero.

Speaker A

Then we're going to return one.

Speaker A

Otherwise, return zero.

Speaker A

And you can verify that this is exactly just a rewrite of this expression.

Speaker A

And similarly, if you reverse the rules of X1 and X2.

Speaker A

Then you can rewrite X H2 as of in vector notation as well.

Speaker A

And now what we're going to do is we're going to just combine H1 and H2 by stacking them.

Speaker A

So, we're going to define this matrix, which is just the two wave vectors here stacked up.

Speaker A

So, we have two rows here.

Speaker A

And we're going to multiply this matrix by the feature vector.

Speaker A

So, remember, left multiplication by a matrix is just taking the dot product with each of them rows of that matrix.

Speaker A

And now this produces a two-dimensional vector.

Speaker A

And we're going to test whether each component is greater equal to zero.

Speaker A

So, in the end, H of X is going to be a two-dimensional vector.

Speaker A

Okay, and now given that, we can rewrite the predictor as simply the sign of the dot product between one one and H of X, which is simply the sum of the two components.

Speaker A

So, now we've written F of X, which is the true function in terms of a bunch of matrix or vector multiplies.

Speaker A

Now, everything in red here are just numbers, and so far we've specified what they are.

Speaker A

But in general, we're not going to know them.

Speaker A

And we're going to have to learn them from data.

Speaker A

But before we do that, we're going to preemptively see one problem that's going to come up.

Speaker A

And this problem we saw before when we tried to optimize the zero one loss.

Speaker A

So, let's look at the gradient of H1 of X with respect to V1.

Speaker A

Um, we can plot this as follows.

Speaker A

So, here is, um, the score Z.

Speaker A

Um, which is the dot product.

Speaker A

And, um, this is H1.

Speaker A

And this is just a step function.

Speaker A

So, the step function or threshold function is just whether Z is greater than zero, it's one over here and zero over here.

Speaker A

Okay, so now if you try to gradient descend on this.

Speaker A

Uh, you're just going to get stuck because the gradients are going to be zero basically everywhere.

Speaker A

So, the solution is to replace this threshold function with a more general activation function, sigma, which has more friendly gradients.

Speaker A

So, classically, and by classic, I mean like in the 80s and 90s, people used the logistic function.

Speaker A

Uh, as activation function.

Speaker A

Which looks like this.

Speaker A

And this is just a kind of a smoothed out version of the threshold function.

Speaker A

And in particular, its gradients are zero, uh, nowhere.

Speaker A

So, that's great.

Speaker A

So, the gradient, you can always move forward.

Speaker A

There is a caveat here, which is that.

Speaker A

If you look out here, this this function is pretty flat, which means that the gradient is actually approaching zero, which means that if you're out here, then you can get stuck or at least make very slow progress.

Speaker A

So, in 2012, the ReLU activation was invented.

Speaker A

Which just takes the max of X Z and zero.

Speaker A

So, that looks like this.

Speaker A

So, if the input to the ReLU is less than zero, I'm just going to keep it clip it to zero, and then otherwise, I'm going to just leave it alone.

Speaker A

So, now this function actually has, um, nice gradients over here.

Speaker A

So, the gradient never vanishes, it's always, you know, positive and bounded away from zero.

Speaker A

Um, although over here it is zero.

Speaker A

So, turns out empirically, the ReLU activation function works really well.

Speaker A

It's simpler in a lot of ways.

Speaker A

So, it's kind of become the activation function of choice here.

Speaker A

So, um, the solution here is to replace the threshold step function with an activation function, choose your favorite, I would choose the ReLU, and now you have something that has, uh, non-vanishing gradients.

Speaker A

So, let's now define two layer neural networks using the machinery that we've said so far.

Speaker A

Okay, so we're going to define some intermediate subproblems, so we start with a feature vector, P of X.

Speaker A

Now, I'm going to represent vectors and matrices using these dots, um, so this is a six-dimensional, uh, feature vector, but in general, it's D-dimensional.

Speaker A

Um, I'm going to next multiply by this weight matrix, which is going to be a three by six, but in general, a K by D matrix, and now that generates a three-dimensional or K-dimensional vector, going to send it through this non-linearity, uh, activation function like the ReLU or the logistic, and I'm going to get a vector, which I'm going to call H of X.

Speaker A

Okay, so now given this H of X, I can now do prediction by taking H of X and simply dot producting it with, uh, a wave vector W, and if I take the sign, and that gives me the prediction of that neural network.

Speaker A

So, one thing that's kind of interesting here is that if you look at this equation, it really pretty much looks like the equation for a linear classifier.

Speaker A

The only difference is now we have H of X.

Speaker A

Instead of P of X.

Speaker A

So, one way to interpret what neural networks are doing is that instead of using the original feature vector, we've kind of learned a smarter representation, and at the end of the day, we're still doing a linear classification on top of that feature representation, so you can often people think about neural networks as doing feature learning for precisely this reason.

Speaker A

And finally, now we can define the hypothesis class, F, is equal to the set of all predictors, um, and the predictor is given parameterized by a weight matrix V and a weight vector W, um, defined up here.

Speaker A

And we can let the weight matrix be any arbitrary K by D, uh, matrix, and we let W be any D-dimensional, uh, vector, sorry, this D should actually be a K there, I will fix that.

Speaker A

Okay, so we have defined the hypothesis class that corresponds to two layer neural networks for classification.

Speaker A

Now, we can kind of push this farther.

Speaker A

We can go and talk about deep neural networks.

Speaker A

So, remember, going back to single layer neural networks, aka linear predictors, we see that we take the feature vector.

Speaker A

We take the dot product with respect to a weight vector, and you get the score, which can be used to drive prediction directly in the regression.

Speaker A

Or take the sign to get classification predictions.

Speaker A

Um, for two layer neural networks.

Speaker A

We take P of X, we take the dot product between layer one's, uh, weight matrix.

Speaker A

Take element-wise, uh, activation function, and then multiply dot product with a wave vector.

Speaker A

You get the score.

Speaker A

And now the key thing is that this piece, V, apply V and then apply sigma, you can just iterate over and over again.

Speaker A

So, here's a three layer neural network.

Speaker A

Take P of X, which is the feature vector, you multiply by some matrix V1.

Speaker A

Take a non-linearity.

Speaker A

Multiply by another matrix.

Speaker A

Take a non-linearity.

Speaker A

And then finally, you get some, uh, vector that you take the dot product.

Speaker A

Um, with W, and you get the score.

Speaker A

Which can be used to power your predictions.

Speaker A

So, one small note is that I've left out all the bias terms, uh, for notational simplicity.

Speaker A

In practice, you would have, uh, you know, bias terms.

Speaker A

Okay, and you can imagine just iterating this, um, you know, over and over again.

Speaker A

But, you know, what is this doing, it it's kind of looks like a little bit of abstract nonsense.

Speaker A

You just multiply by matrices and sending them through non-linearity.

Speaker A

And you hope something good happens.

Speaker A

And, you know, that's not there's not, uh, completely false, but there are some intuitions which we can derive.

Speaker A

So, one intuition is thinking about layers as representing multiple levels of abstraction.

Speaker A

So, in computer vision, let's say the input is, uh, an image.

Speaker A

So, you can think about the first layer as computing some sort of notion of edges, and the second layer, when you multiply matrix and you take a non-linearity, you compute some notion of object parts, and then the third layer, you, um, multiply by matrix and apply some non-linearity.

Speaker A

You get some notion of objects.

Speaker A

Now, this is kind of a just a story, and we haven't talked at all about learning, so this is definitely not true for all neural networks, it turns out that when you actually learn a network to data and you visualize what the weights are, you actually do get some interpretable results, which is kind of interesting and, you know, somewhat surprising.

Speaker A

So, now there's a question of, uh, depth.

Speaker A

So, the fact that you take a feature vector and you apply, um, some sort of transformation again and again and again to get a score.

Speaker A

So, why, why do we do this?

Speaker A

So, one intuition that we talked about already is this is representing different levels of abstraction, so kind of low-level, uh, pixels to high-level object parts and objects, um, another way to think about this is this is performing multiple steps of computation.

Speaker A

Just like in a classic program, if you get more steps of computation, it gives you more expressive power, you can do more things.

Speaker A

You can think about each of these operations is simply doing some compute.

Speaker A

Now, it's it's maybe a kind of a foreign type of compute because you're multiplying by a crazy unknown matrix, but what we we can think about this is that you set up this compute computation.

Speaker A

And learning algorithm is going to figure out what kind of computation makes sense for making the best prediction.

Speaker A

Um, another piece of intuition is that empirically, it just happens to work really well.

Speaker A

Um, which is not not to be understated.

Speaker A

Um, if you're actually looking for a more theoretical reason.

Speaker A

Um, there the jury's kind of still out on this, you can have, uh, intuitions how, um, you know, deeper logical circuits can capture more than shallower ones, but then there's like the kind of relationship between circuits and neural networks, which is, um, requires a little bit of massaging, so this is still kind of a pretty active area of research.

Speaker A

So, summarize, we start out with a very toy problem, the XOR problem, testing whether two cars are going to collide or not.

Speaker A

And we used it to motivate problem decomposition and eventually defining a neural networks, um, we saw that intuitively, neural networks allow you to define non-linear predictors, but in a particular way.

Speaker A

It in and the way is to decompose the original problem into intermediate subproblems.

Speaker A

Testing if the car is to the far right or the far left.

Speaker A

And then combining them, you know, over time.

Speaker A

And you can kind of take this idea further and iterate on this decomposition multiple times, giving rise to multiple levels of abstraction, multiple steps of computation, a hypothesis class is now larger, it contains all predictors where the weights of all the layers can vary, um, freely.

Speaker A

And the next up, we're going to show you how to actually learn the weights of a neural network, that is the end.

Topics:neural networksnon-linear predictorsproblem decompositionXOR problemactivation functionsReLUlogistic functiongradient descentmachine learningtwo-layer neural network

Frequently Asked Questions

Why can't pure linear classifiers solve the XOR problem?

Pure linear classifiers cannot solve the XOR problem because the classes are not linearly separable; no single line can separate the two classes in the input space.

What is the role of activation functions in neural networks?

Activation functions replace the step function to provide non-zero gradients, enabling gradient-based optimization methods like gradient descent to effectively train neural networks.

Why is ReLU preferred over the logistic function as an activation function?

ReLU is preferred because it has non-vanishing gradients over a large input range, is simpler to compute, and helps avoid the slow learning issues caused by the flat regions of the logistic function.

Get More with the Söz AI App

Transcribe recordings, audio files, and YouTube videos — with AI summaries, speaker detection, and unlimited transcriptions.

App Store Google Play

Or transcribe another YouTube video here →