CS231n basic backpropagation derivation

This post covers the backpropagation derivation for an affine layer in a basic fully-connected neural network, as part of work on the second assignment for the Winter 2016 iteration of the Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.

Consider the loss function for the net evaluated at the $i$th observation

where $p_{y_i}$ is the vector of normalised probabilities

$f_j = \boldsymbol{w_j}^\top \boldsymbol{x_i} + b_j$ is the score of the $i$th observation corresponding to the $j$th class, $\boldsymbol{x_i}$ is the $D$-dimensional column vector corresponding to the $i$th observation, and $\boldsymbol{w_j}$ is the column vector corresponding to the weights for the $j$th class, for $i=1,\cdots,N$ and $j=1,\cdots,K$.

Given the upstream derivative of the loss with respect to the scores

we’re interested in deriving the derivative of the loss with respect to the weights and biases. Starting with the weights, we can use the chain rule to get to the desired derivative via the derivatives of the loss with respect to the scores

which follows from the derivative $\frac{\partial f_k}{\partial \boldsymbol{w_j}} = \boldsymbol{x_j}\mathbb{1}\left(j=k\right)$. The $j$th score is the only score where the vector $\boldsymbol{w_j}$ arises, and so the derivatives of the other scores with respect to $\boldsymbol{w_j}$ are all zero. The total loss $\mathcal{L}$ is obtained as the summation of the losses across all $N$ observations, and so

This matrix of the derivatives of the loss with respect to the weights is simply obtained as the matrix product of the observation matrix and the matrix of the derivatives of the loss with respect to the scores

The derivative of the loss with respect to the biases is obtained similarly. Again using the chain rule,

which follows from the derivative $\frac{\partial f_k}{\partial \boldsymbol{b_j}} = \mathbb{1}\left(j=k\right)$. The derivative of the loss across all observations is again obtained via summation

and therefore

A great walkthrough of the implementation is python is available here.

Written on April 7, 2017