One of the most extreme issues with recurrent neural networks (RNNs) are vanishing and exploding gradients. Whilst there are many methods to combat this, such as gradient clipping for exploding gradients and more complicated architectures including the LSTM and GRU for vanishing gradients, orthogonal initialization is an interesting yet simple approach.
To help solidify our understanding of why orthogonal initialization works, we'll explore it in terms of basic linear algebra and stability theory.
A fundamental action performed in deep learning is repeated matrix multiplications. These repeated matrix multiplications mean the resulting matrix is exponential in the number of layers of the neural network. This can have serious numerical stability issues, with the result exploding or disappearing quickly. One of the worst culprits in terms of this are RNNs which repeatedly update an internal state using a single weight matrix, sometimes over hundreds or thousands of timesteps.
Stability theory asks what the dynamics of a system are after a long period of time and can be used to gain insights here. While stability theory covers many methods, we're particularly interested in how eigenvalues can be used to quickly compute repeated matrix multiplication.
Let's take a step away from RNNs for a second to establish some fundamentals.
Imagine you wanted to compute the Fibonacci sequence using matrices. It turns out there exists a matrix \(F\) that, when you perform repeated matrix multiplication on it, you can use to iteratively compute the Fibonacci sequence.
Special note, a matrix to the power of 0 gives you the identity matrix, placing zeros in the \(F^{n}\) entries of the matrix, giving the correct solution of \(F^0 = 0\).
Let's compute the 4th entry in the Fibonacci sequence using this method, starting at identity and then performing a matrix multiplication with \(F\) at each step. Notice that the intermediate values on the diagonals are all part of the Fibonacci sequence - 0, 1, 1, 2, 3.
Expanded out like this, it's also clear we could calculate \(F^4\) by computing \((F^2)^2\) instead of \(F^4\). This allows us to compute the nth entry in the Fibonacci sequence in \(O(\log n)\) time.
We can go one step better however, computing the nth entry in the Fibonacci sequence in \(O(1)\) time (assuming constant time pow() function) whilst also gaining some intuition as to what happens as it grows.
Time to dust off some elementary linear algebra. An eigendecomposition of a matrix represents the matrix in terms of its eigenvalues and eigenvectors. The matrix \(F\) above can be factorized as \(F = Q \Lambda Q^{-1}\), where \(\Lambda\) is a diagonal matrix with the eigenvalues placed on the diagonals and \(Q\) is a matrix composed of the corresponding eigenvectors of \(F\).
An immediate and sane question to ask is why bother representing the matrix this way? Let's use this new matrix to compute the next Fibonacci value.
Other than the two constant matrix multiplications, the only extra work you need to do is raising the \(\Lambda\) matrix to the nth power! This holds true for any power you might want to apply, such that \(F^n = Q \Lambda ^ n Q^{-1}\).
What this simply means is that the eigenvalues are what dictate the growth or death of the result as we perform repeated matrix multiplication.
Using the Fibonacci matrix above, notice that one eigenvalue explodes whilst the other eigenvalue vanishes as we get to larger and larger \(n\) for \(\Lambda^n\). Even at \(n = 10\), we find that one of the eigenvalues is almost irrelevant.
Indeed, if you just multiply a reasonably large Fibonacci number by the largest eigenvalue, you get essentially the next Fibonacci value. It's no surprise that the largest eigenvalue happens to be the Golden ratio, \(\varphi\), or the converged ratio of consecutive Fibonacci numbers.
>>> [np.round(13 * 1.618 ** i) for i in range(5)] [13, 21, 34, 55, 89]
The summary is that, as \(n\) gets asymptotically large:
Orthogonal matrices have many interesting properties but the most important for us is that all the eigenvalues of an orthogonal matrix have absolute value 1. This means that, no matter how many times we perform repeated matrix multiplication, the resulting matrix doesn't explode or vanish.
It's interesting to note what the constraint that an eigenvalue must have absolute value 1 means. If we are only using real numbers, that means the eigenvalues must be either +1 or -1, resulting in very boring matrices. We can extend to complex eigenvalues however, allowing for far more interesting results when repeatedly multiplied.
Even if the eigenvalues are complex, they can still result in matrices that are all real. The simplest example of this is the 2x2 rotation matrix. Whilst the rotation matrix \(R\) below (which performs a 90 degree rotation) is a real matrix, the eigenvalues and eigenvectors are complex.
For this illustration, we're only looking at a simplified RNN model. For simplicity we assume no inputs, no bias, an identity activation function \(f\), and the initial hidden state of the RNN \(h_0\) is an identity matrix.
When we perform repeated matrix multiplication within an RNN and the result explodes or vanishes, we also land in the realm of vanishing or exploding gradients. If the gradients vanish, training is at a near standstill as no information is backpropagated. If the gradients explode, training may never converge as the gradient update wildly fluctuates. Both these are hugely problematic and can occur in a very small number of timesteps.
When initializing the weight matrix in RNNs, it's not uncommon to use random uniform or random normal initialization. This method gives no guarantee as to the eigenvalues of the weight matrix and are likely to result in exploding or vanishing results.
If we instead use an orthogonal matrix to initialize the weights, the resulting matrix neither explodes nor vanishes. This allows for gradients to backpropagate more effectively.
To construct an orthogonal matrix, we can use singular value decomposition, as seen in the Lasagne init.py.
Let's see what happens to the resulting matrix when performing repeated matrix multiplication over 64 timesteps. Remember that it's also not uncommon for RNNs to run for hundreds or even thousands of timesteps. Neural networks also usually have additional complications such as a non-linear activation function and varying input.
Orthogonal initialization is a simple yet relatively effective way of combatting exploding and vanishing gradients, especially when paired with other methods such as gradient clipping and more advanced architectures. While we haven't gone into detail with mathematical derivation of the gradients, it is clear when visualizing orthogonal initialization why it can be so effective. The story becomes far more complicated when we add in real activation functions and input however.
If you're interested in more detail, I highly recommend the papers below which all explore various aspects of this.
Interested in saying hi? ^_^
Follow @Smerity