Machines Learning Matt

*Folksy voice

Sometimes I can’t tell if I’m teaching the machines or the machines are teaching me.

Handwriting a neural net, part 8 - Backpropagation, the Learning Part

Now we’re finally ready to start using all of our calculus tools to adjust our weights by the error.

But let’s ask ourselves, what exactly will we be doing.

Partial Derivatives to Help Us Down the Slope

We should note that when we are doing y = x², and finding the derivative, the correct terminology is the say that we are getting the derivative, with respect to x.

Why do we say with respect to x? Well, there could be another variable in there. Let’s say we had the variable z in a different function where y = x² + 2z.

In that case we may want to decide whether we want to find the derivative with respect to x or with respect to z. If you find the derivative with respect to one variable, you treat all the other variables like they were constants. Meaning you don’t do anything special to them, you just kind of pretend like they are a single number, like 2 or 5.

So the derivative of y = x² + 2z with respect to x is 2x, because if we pretend that 2z is just a constant number, it goes away, because numbers by themselves without variables go away.

And the derivative of y with respect to z is just 2, because now x goes away because there is no z attached.

If we have a different function, say x² * 3z + 2x *z²

With respect to x the derivative is 2x*3z + 2*z²

With respect to z the derivative is x² * 3 + 2x* 2z.

So those are both called partial derivatives because we’re solving them in part.

If we wanted to solve the total derivative, we generally have a common element at the end, like saying that both z and x are functions themselves with maybe a shared variable t. In a lot of total derivative examples, t is time and you can take the derivative of two equations together involving time, and that way you solve for things like water draining out a bowl at a certain rate.

However, we’re not very interested in the total derivative. We want the partial derivative.

Because remember how we have multiple weights in the first neuron? How we have weights for the importance of relationships, money, sandwiches and just being there? Those are all different variables in our function for happiness when we do the matrix multiplication.

That means we’re going to find the partial derivative of all those weights. And when the partial derivative is lined up inside of its own matrix, that gives us a big grid of coordinates. It gives us a gradient. Think of it like the gradient of a road, the way it slopes or goes up and down. Basically all of those weights become coordinates in a system and point up or down in certain directions.

And again, what we want to do is move in the downward direction of our gradient so that we get the least possible error. We want to get to do what’s called gradient descent.

Putting It Together

Cool, so let’s look again at the formula we used for our forward pass.

neuron1output = sigmoid(features dot W1)

neuron2output = sigmoid(neuron1output dot W2)

y = neuron2output

So our first step is to apply our sum of squares formula for our cost function, which is (y* - output)²

So let’s start using the chain rule. As you may remember, the chain rule says you keep multiplying outside derivative by inside derivatives until you chain it all the way down. It’s sort of a derivative Russian nesting doll situation.

Let’s describe our overall goal.

What we want is the partial derivative of the loss function with respect to the weights of the last layer because we’re about to try and update W2. That’s the end goal for this first set of derivatives.

So we want CostFunctionDerivative with respect to the input.

To chain it all the way down to W2, we’ll need to do this:

(CostFunctionDerivative with respect to the output) * (the output function derivative with respect to our matrix multiplication) * (the matrix multiplication derivative with respect to W2)

So let’s take it a step at a time.

1. (CostFunctionDerivative with respect to the output) = derivative of (y* - output)² = 2(y* - output)

2. (Output function derivative with respect to our matrix multiplication) = derivative of sigmoid(dotproduct) = sigmoidDerivative(dotproduct). Sigmoid is 1.0/(1+ e^-x). And when we workout sigmoid derivative, it is sigmoid(x) * (1.0 - sigmoid(x)).

3. (The matrix multiplication derivative with respect to W2) = neuron1output dot W2. This is like saying the derivative of 5x with respect to x, which is just 5. So in our case, if we consider x as W2, It’s just neuron1output

So then we put that all together to find the loss with respect to the weights.

2(y* - output) * (sigmoid(output) * (1 - sigmoid(output))) dot neuron1output

And from our forward pass, we have all of these values!

y* = [

[1.0],

[0.85],

[0.4]

]

output = [

[0.59698483],

[0.58668634],

[0.59037826]

]

neuron1output = [

[0.78583498],

[0.70056714],

[0.73105858]

]

We have the 3 values we need for the first part!

In order to make that math work we’ll need to transpose neuron1output, but otherwise, we can just do everything inside of our new derivativeWeights function which is:

neuron1output dot 2(y* - output) * (sigmoid(output) * (1 - sigmoid(output)))

So let’s get down to the nitty gritty.

Putting In the Work

So we’ll do 2(y* - output) first.

y* - output =

[

[1.0 - 0.59698483],

[0.85 - 0.58668634],

[0.4 - 0.59037826]

]

=

[

[ 0.40301517],

[ 0.26331366],

[-0.19037826]

]

So 2(y* - output) =

[

[ 0.40301517],

[ 0.26331366],

[-0.19037826]

]

Which in this case is scalar multiplication, so you just multiply 2 by each number. It’s not a dot product

=

[

[ 0.80603034],

[ 0.52662731],

[-0.38075651]
]

Next let’s take (sigmoid(output) * (1 - sigmoid(output)))

We’ll do the sigmoid calculations for you. They’re just like last time.

So sigmoid(output) =

[

[0.64496618],

[0.64260448],

[0.64345193]

]

and (1 - sigmoid(output)) =

[

[0.35503382],

[0.35739552],

[0.35654807]

]

So sigmoid(output) * (1 - sigmoid(output))) =

[

[0.64496618],

[0.64260448],

[0.64345193]

]

*

[

[0.35503382],

[0.35739552],

[0.35654807]

]

And you notice again that it’s a * instead of a dot product. That means we just multiply each of the corresponding numbers together. Top times top, middle times middle, bottom times bottom. All of that equals

=

[

[0.22898481],

[0.22966396],

[0.22942154]

]

So now we multiply our two matrixes together, again with * scalar multiplication instead of a dot product:

2(y* - output) * (sigmoid(output) * (1 - sigmoid(output))) =

[

[ 0.80603034],

[ 0.52662731],

[-0.38075651]
]

*

[

[0.22898481],

[0.22966396],

[0.22942154]

]

=

[

[ 0.1845687 ],

[ 0.12094732],

[-0.08735375]

]

It’s only in the last step that you want to do the dot. That’s because you want to get them back into the shape of the original W2. Which in our case is a 1x1 matrix. We’ll also need to transpose neuron1output to make the shape work. That’s what the .T in neuron1output.T means.

So we do neuron1output.T dot 2(y* - output) * (sigmoid(output) * (1 - sigmoid(output))) =

[0.78583498, 0.70056714, 0.73105858]

]

dot

[

[0.22898481],

[0.22966396],

[0.22942154]

]

=

[

[(0.78583498 * 0.22898481) + (0.70056714 * 0.22966396) + (0.73105858 * 0.22942154)]

]

=

[

[0.16591155]

]

Finally!

After all that, we now know that we need to add
[

[0.16591155]

]

To our weight from the first neuron

[

[0.5]

]

=

[

[0.66591155]

]

All of that was to update W2, the weights for the second neuron layer. Now we need to do more chain ruling for the weights of our first neural layer.

You once again have to start with the chain rule, and you have to start from the end and go backwards, but this time you have to chain rule back even further! And instead of the derivative with respect to W2, we need to get the derivative with respect to W1.

Just to see what it looks like, when you work it out:

Input.T dot ((2(y* - output) * (sigmoid(output) * (1 - sigmoid(output))) dot W2.T)) * (sigmoid(neuron1output) * (1 - sigmoid(neuron1output)))

That is a lot of math.

Since we did the first part already, and it’s a question of doing the same scalar and dot product math again, and applying sigmoid again, we’ll skip to the solution. (But if you want the full, handwritten neural net experience, dive right in!)

The new first layer of weights become

[

[0.2684694 ],

[1.01472558],

[0.25851005],

[1.00614301]

]

And our second layer of weights is

[

[0.66591155]

]

So look at that! Our neural network just learned!

Or did it?

Image by Tumisu from Pixabay

Matthew Waller