CS231n basic backpropagation derivation
This post covers the backpropagation derivation for an affine layer in a basic fully-connected neural network, as part of work on the second assignment for the Winter 2016 iteration of the Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.
Consider the loss function for the net evaluated at the $i$th observation
where $p_{y_i}$ is the vector of normalised probabilities
$f_j = \boldsymbol{w_j}^\top \boldsymbol{x_i} + b_j$ is the score of the $i$th observation corresponding to the $j$th class, $\boldsymbol{x_i}$ is the $D$-dimensional column vector corresponding to the $i$th observation, and $\boldsymbol{w_j}$ is the column vector corresponding to the weights for the $j$th class, for $i=1,\cdots,N$ and $j=1,\cdots,K$.
Given the upstream derivative of the loss with respect to the scores
we’re interested in deriving the derivative of the loss with respect to the weights and biases. Starting with the weights, we can use the chain rule to get to the desired derivative via the derivatives of the loss with respect to the scores
which follows from the derivative $\frac{\partial f_k}{\partial \boldsymbol{w_j}} = \boldsymbol{x_j}\mathbb{1}\left(j=k\right)$. The $j$th score is the only score where the vector $\boldsymbol{w_j}$ arises, and so the derivatives of the other scores with respect to $\boldsymbol{w_j}$ are all zero. The total loss $\mathcal{L}$ is obtained as the summation of the losses across all $N$ observations, and so
This matrix of the derivatives of the loss with respect to the weights is simply obtained as the matrix product of the observation matrix and the matrix of the derivatives of the loss with respect to the scores
The derivative of the loss with respect to the biases is obtained similarly. Again using the chain rule,
which follows from the derivative $\frac{\partial f_k}{\partial \boldsymbol{b_j}} = \mathbb{1}\left(j=k\right)$. The derivative of the loss across all observations is again obtained via summation
and therefore
A great walkthrough of the implementation is python is available here.