published: 2019-09-01
started: 2019-09-01
last change: 2022-10-27

Implementing an LSTM from Scratch

---- DRAFT ----
This article is not finished. It gives a good introduction to get started, but it lacks roughly the second half of what is necessary to implement an LSTM.

Calculation Graph Layout

The following diagrams are inspired by Christopher Olah's article Understanding LSTM Networks.

Here on this page there are additional details in the diagrams, most notably meaningful names of the interim result vectors for all the operations inside of the LSTM cell. They are helpful for understanding each operation in more detail and for doing the maths of the operations and their backpropagations.

	- fully connected neural network layer with sigmoid activation function
	- element-wise vector operation, here: multiplication
	- vector operation, either element-wise or slightly more complex like softmax
	- stacking (concatenation) of multiple vectors into one

All of the variables in the diagram represent vectors:

s: long term state at the beginning of the current step
h: short term state at the beginnign of the current step
x: input for the current step
z: concatenation of h and x into one vector
f: forget factors
r: remaining state after forget gate
m: memorize factors
p: prospective additional state data to be memorized
a: data which passed the memorize gate
o: output factors
q: prospective output
s': long term state at the end of the current step
h': short term state at the end of the current step
ŷ: output of the current step

If you're wondering where the weight matrices are: They are used inside of the neural network layers, like

, but they are not named here explicitly. See below for more details.

Variables, Types and Vector Sizes

In the implementation when evaluating in the forward pass, you can use exactly the variables in the above list, plus the 4 matrices for the 4 neural network layers. Keep all of them for later when you do the gradient calculation of the backpropagation pass. They come in very handy by then.

The first thing you have to decide is what data type the variables shall use and the sizes of the vectors, i.e. how many elements they shall have.

For all the numbers in neural networks it's typical to use a 4 byte floating point type. That's a fair tradeoff between required space and speed on one side and precision on the other side.

All sizes of vectors, i.e. number of elements in the vectors, and sizes of matrices depend on the size of the input vector x that you want to feed into the LSTM and the size of the desired output vector ŷ. Like with all neural networks, these sizes depend on the particular problem for which you want to train the network. To find them, look at the problem you want to solve and think about what data you have there in vectors - these are your input vectors - and what you want to predict or how many categories there are into which you want to classify the data - these are your output vectors.

After you decided these two sizes of the vectors x and ŷ, you can tell the sizes of the other vectors and matrices. In the diagram almost all of the variables are connected with size-preserving vector operations (, , ...). Vectors going in and vectors coming out of such an operation are of the same size.

We know one of them, ŷ. It's the output vector. Thus we know its size. Consequently, all of the other vectors which are connected to it via size-preserving operations are of the same size. Going backwars from ŷ in the diagram we find that the connected vectors are h', o, q, s', a, p, m, r, f, and s. h is the output vector of the previous step, i.e. the h' of the previous step, and therefor it's also of the same size.

As it turns out, only x and z can be of different size than ŷ. x is the input vector, thus its size depends on the problem for which you want to train the network. And z is the concatenation [h,x], thus its size is the sum of the sizes of h and x.

Matrices for Neural Layers

There are 4 neural network layers in an LSTM cell. In each layer, there are weights with values that determine how much each neuron activity from the incoming layer affects each neuron in the next layer. The incoming layer is alway z and the next layers are f, m, p, and o.
The weights of a fully connected layer are typically represented by a matrix. We have 4 layers in one cell, so we need 4 matrices in total. These weights, i.e. the values in the matrices, are the main part that will be adjusted when training the cell. The 4 matrices are for calculating f, m, p, and o, so they are called W_f, W_m, W_p, and W_o.
For calculating what the neural layers do, there are 4 equations:



  f = sigmoid(W_f*z + b_f)
  m = sigmoid(W_m*z + b_m)
  p =    tanh(W_p*z + b_p)
  o = sigmoid(W_o*z + b_o)

For calculating the forward pass, you take the input values of this step, which are s, h, and x. From these, you need to follow through all of the operations in the diagram up to s', h', and ŷ. The first step is the easiest, stacking h and x into one vector z.

To be continued...

further topics to cover:
- activation functions and loss functions - impossible to use sig*tanh directly with cross entropy because it produces log(negative)=nan
- cross entropy works well with softmax
- why h' is not yet the output ŷ
- possible necessity of an additional sigmoid layer between h' and ŷ
- derivatives of sig, tanh, softmax
- chain rules
- getting all the derivatives of the variables
- how to handle training on sequences with memory etc; s, h, s', h', ŷ ...
- s and h of step 0 are s₀ and h₀, i.e. the initial state. s₀ and h₀ must also be learned.