Machines Learning Matt

*Folksy voice

Sometimes I can’t tell if I’m teaching the machines or the machines are teaching me.

Handwriting a neural net, part 4 - Pass It Forward

Let’s go over what we’ve got. We’ve got a bunch of features from Alejandra, Bob and Alice. These are their scores in terms of relationships, money, sandwiches and baseline happiness. And we’ve got some weights that add up well when we multiply them through for Bob and Alejandra, but we’re not sure about how they’ll work with Alice’s numbers. Rather than do a bunch of guessing and trial and error, we’re going to work out what those weights should be with machine learning.

We also learned that we can do all that multiplying of features (the scores of relationships, money, sandwiches and the baseline) by the weights (numbers that represent the importance of the scores of relationships, money, sandwiches and the baseline), using something called a dot product, also known as a type of matrix multiplication where we multiply columns by rows and add them together.

When we’ve done the dot product inside a single neuron layer, the layer applies its result to something called an activation function. An activation function decides what part of the result to move forward.

Let’s go ahead and work this out with two layers of neurons.

The ultimate thing we’re looking for when we’re training a neural net is something that we call the ground truth, sometimes called y* or y prime. In the end, we want our answers to be close to the ground truth.

Let’s say 0.0 is least happy and 1.0 is maximum happy. Now let’s say the right amount of happiness for Alejandra is 1.0, Bob is 0.85, and Alice is at 0.4.

Now let’s get our features:

[    Money   Relationships Sandwiches  Baseline

Alejandra [40,        80,           40,         30],

Bob   [100,       20,           40,         30],

Alice [30, 40, 50, 40]

]

And since these are scores, let’s normalize them a bit. Normalize means make it so that it all fits relative to one another. Since our other terms are in decimal percentages, let’s move the decimals over a couple times for these numbers to make sure everyone is on the same page.

[    Money   Relationships Sandwiches  Baseline

Alejandra [0.4,        0.8,           0.4,         0.3],

Bob   [1.0,       0.2,           0.4,         0.3],

Alice [0.3, 0.4, 0.5, 0.4]

]

And then we have our weights for the first neuron, called W1, for Weights First Neuron:

[

Money Importance: 0.25,

Relationship Importance: 1.0,

Sandwiches Importance: 0.25,

Baseline Importance: 1.0

]

We’ll deal with the weights for our second layer in a minutes.

Now let’s look at all of that without the labels, to get more accustomed to the mathematics.

Our y*: [

[1.0],

[0.85],

[0.4]

]

Our features: [

[0.4,        0.8,           0.4,         0.3],

[1.0,       0.2,           0.4,         0.3],

[0.3, 0.4, 0.5, 0.4]

]

W1 = [0.25, 1.0, 0.25, 1.0]

W2 = ?

We actually don’t know what our second set of initial weights will be. As a quick look ahead, given the matrix shapes we’re starting with and the matrix shapes we want to end up with, we know we’ll want a 1x1 shaped matrix. That mean a matrix of one row and one column. We’ll just pick a random number to be in our matrix that is the shape of 1x1 (one row and one column). That number should change to the right one by the time we’re all done!

W2 = [

[0.5]

]

And that’s all we need to do the forward pass!

The Forward Pass

So now with every thing in place, let’s do the forward pass. This is where our neural network takes its first guess. This is before it has “learned” anything.

So here we go.

The overall equation, in a sort of coded manner, will look like this:

neuronLayer1 = sigmoid(features dot W1)

This means we’re going to apply the sigmoid function to the values that come out of the features and W1 dot product.

neuronLayer2 = sigmoid(neuron1output dot W2)

This means we’re going to apply the sigmoid function to the values that come out of the first output and W2 dot product.

output = neuronLayer2output

Don’t worry if that looks a little confusing. We’ll work out exactly what that looks like now.

So first lets do features dot W1

[

[0.4,        0.8,           0.4,         0.3],

[1.0,       0.2,           0.4,         0.3],

[0.3, 0.4, 0.5, 0.4]

]

dot

[0.25, 1.0, 0.25, 1.0]

Ah, but you’ll notice that we’re dotting a 3x4 times a 1x4. Notice that the inner numbers, the 4 and the 1, are not the same. So this matrix multiplication won't work. No problem! Transpose to the rescue. Let’s put a transpose on the second matrix.

Now [0.25, 1.0, 0.25, 1.0] becomes:

[

[0.25],

[1.0],

[0.25],

[1.0]

]

So now we have 4 rows and 1 column. It’s a 4x1 now. So now we have 3x4 dot 4x1. Now the inner numbers are both 4! So let’s try again.

[

[0.4,        0.8,           0.4,         0.3],

[1.0,       0.2,           0.4,         0.3],

[0.3, 0.4, 0.5, 0.4]

]

dot

[

[0.25],

[1.0],

[0.25],

[1.0]

]

This is the same as:

[

[(0.4 * 0.25) + (0.8 * 1.0) + (0.4 * 0.25) + (0.3 * 1.0)],

[(1.0 * 0.25) + (0.2 * 1.0) + (0.4 * 0.25) + (0.3 * 1.0)],

[(0.3 * 0.25) + (0.4 * 1.0) + (0.5 * 0.25) + (0.4 * 1.0)]

]

So again, adding up rows times columns. And that gives us:

[

[1.3],

[0.85 ],

[1.0]

]

Great! We got the dot product. Now let’s put it in our activation function. Our activation function is sigmoid, again which is: 1 / (1 + e^-x). The e^-x means e raised to the power of -x. Like how x² means x to the power of 2.

(As a side note, when we say x to the power of a negative number, it means 1 divided by x raised to the power of the positive number. So x^-2 is equal to 1 / x^2).

(Another side note, to do the e calculations, you can use a calculator. It’s kind of like working with the number pi.)

When we apply the activation function, the values become:

[

[1 / (1 + e^-1.3)],

[1 / (1 + e^-0.85)],

[1 / (1 + e^-1.0)]

]

And that equals

[

[0.78583498],

[0.70056714],

[0.73105858]

]

So these values become the features for our next set.

So now we’ll do

[

[0.78583498],

[0.70056714],

[0.73105858]

]

dot

W2 = [

[0.5]

]

And that adds up to rows times columns:

[

[0.78583498 * 0.5],

[[0.70056714 * 0.5],

[0.73105858 * 0.5]

]

Which gives us:

[

[0.39291749],

[0.35028357],

[0.36552929]

]

Finally, we put those values inside of sigmoid yet again to activate it:

[

[1 / (1 + e^-0.39291749)],

[1 / (1 + e^-0.35028357)],

[1 / (1 + e^-0.36552929)]

]

And that equals:

[

[0.59698483],

[0.58668634],

[0.59037826]

]

So there we have it!

What we said the truth was (our y*) is

[

[1.0],

[0.85],

[0.4]

]

What we got was

[

[0.59698483],

[0.58668634],

[0.59037826]

]

So our neural network’s weights are not correct on the first try. Otherwise those values would be the same, or very close.

Next we’ll decide how to measure how well we did.

Matthew Waller