published: 2019-09-01
started: 2019-09-01
last change: 2021-05-20

# Implementing an LSTM from Scratch

---- DRAFT ----
This article is not finished. It gives a good introduction to get started, but it lacks roughly the second half of what is necessary to implement an LSTM.

### Calculation Graph Layout

The following diagram is inspired by Christopher Olah's article Understanding LSTM Networks. There are more details in the diagrams on this page, e.g. the naming of interim results r, a, and q. They are helpful for doing the maths of the backpropagation.  - fully connected neural network layer with sigmoid activation function - element-wise vector operation, here: multiplication - vector operation, either element-wise or slightly more complex like softmax - stacking (concatenation) of multiple vectors into one

All of the variables in the diagram represent vectors:

s: long term state at the beginning of the current step
h: short term state at the beginnign of the current step
x: input for the current step
z: concatenation of h and x into one vector
f: forget factors
r: remaining state after forget gate
m: memorize factors
p: prospective additional state data to be memorized
a: data which passed the memorize gate
o: output factors
q: prospective output
s': long term state at the end of the current step
h': short term state at the end of the current step
ŷ: output of the current step

If you're wondering where the weight matrices are: They are used inside of the neural network layers, like , but they are not named here explicitly. See below for more details.

### Variables, Types and Vector Sizes

In the implementation when evaluating in the forward pass, you can use exactly the variables in the above list, plus the 4 matrices for the 4 neural network layers. Keep all of them for later when you do the gradient calculation of the backpropagation pass. They come in very handy by then.

The first thing you have to decide is what data type the variables shall use and the sizes of the vectors, i.e. how many elements they shall have.

For all the numbers in neural networks it's typical to use a 4 byte floating point type. That's a fair tradeoff between required space and speed on one side and precision on the other side.

All sizes of vectors, i.e. number of elements in the vectors, and sizes of matrices depend on the size of the input vectors that you want to feed into the LSTM (x) and the size of the desired output vectors (h). Like with all neural networks, these sizes depend on the particular problem for which you want to train the LSTM. To find them, look at the problem you want to solve and think about what data you have there in vectors - these are your input vectors - and what you want to predict or what are the categories into which you want to classify the data - these are your output vectors.

After you decided these two sizes, you can tell the sizes of the other vectors and matrices. In the diagram the variables to the top and to the right of the or layers are all connected with size-preserving vector operations ( , , ...). Vectors coming in or out of such an operation must all be of equal size. We know one of them, h. It's the output vector. Thus we know its size. Consequently, all of the other vectors which are connected to it via element-wise operations are of the same size. These are s, r, s', f, m, a, p, o, and q. h is the output vector from the previous step, thus it's also of the same size.

As it turns out, only x and z can be of different size. x is the input vector, thus its size depends on the problem for which you want to train the network. And z is the concatenation [h,x], thus its size is the sum of the sizes of h and x.

### Matrices for Neural Layers

There are 4 neural network layers in an LSTM cell. These weights are values which determine how much each neuron activity from the incoming layer affects each neuron in the next layer. The incoming layer is alway z and the next layers are f, m, p, and o.
The weights of a fully connected layer are typically represented by a matrix. We have 4 layers in one cell, so we need 4 matrices in total. These weights, i.e. the values in the matrices, are what will be adjusted during training of the cell. The 4 matrices are for calculating f, m, p and o, so they are called Wf, Wm, Wp and Wo.
For calculating what the neural layers do, there are 4 equations:

``` f = sigmoid(Wf*z + bf) m = sigmoid(Wm*z + bm) p = sigmoid(Wp*z + bp) o = sigmoid(Wo*z + bo) ```

For calculating the forward pass, you have the input values s, h, and x. From these, you need to follow through all the operations in the diagram up to s and h. The first step is the easiest, stacking h and x into one vector z.

To be continued...

further topics to cover:
- activation functions and loss functions - impossible to use sig*tanh directly with cross entropy because it produces log(negative)=nan
- why h is not the output for classification etc
- derivatives of sig, tanh, softmax
- chain rule
- getting all the derivatives of the variables
- cross entropy works well with softmax
- how to handle training on sequence with memory etc, s_prev, h_prev, s, h, y...
- h0 alos belongs to the initial state, thus it must also be learned
- s and h are s0 and h0 for cell0, thus learnables for cell0, but later they are calculated and not learnables.