« Smerity.com

Explaining and illustrating orthogonal initialization for recurrent neural networks

One of the most extreme issues with recurrent neural networks (RNNs) are vanishing and exploding gradients. Whilst there are many methods to combat this, such as gradient clipping for exploding gradients and more complicated architectures including the LSTM and GRU for vanishing gradients, orthogonal initialization is an interesting yet simple approach.

To help solidify our understanding of why orthogonal initialization works, we'll explore it in terms of basic linear algebra and stability theory.

Eigenvalues, stability theory, and repeated matrix multiplication

A fundamental action performed in deep learning is repeated matrix multiplications. These repeated matrix multiplications mean the resulting matrix is exponential in the number of layers of the neural network. This can have serious numerical stability issues, with the result exploding or disappearing quickly. One of the worst culprits in terms of this are RNNs which repeatedly update an internal state using a single weight matrix, sometimes over hundreds or thousands of timesteps.

Stability theory asks what the dynamics of a system are after a long period of time and can be used to gain insights here. While stability theory covers many methods, we're particularly interested in how eigenvalues can be used to quickly compute repeated matrix multiplication.

Let's take a step away from RNNs for a second to establish some fundamentals.

Repeated matrix multiplication and Fibonacci

Imagine you wanted to compute the Fibonacci sequence using matrices. It turns out there exists a matrix \(F\) that, when you perform repeated matrix multiplication on it, you can use to iteratively compute the Fibonacci sequence.

Special note, a matrix to the power of 0 gives you the identity matrix, placing zeros in the \(F^{n}\) entries of the matrix, giving the correct solution of \(F^0 = 0\).

Let's compute the 4th entry in the Fibonacci sequence using this method, starting at identity and then performing a matrix multiplication with \(F\) at each step. Notice that the intermediate values on the diagonals are all part of the Fibonacci sequence - 0, 1, 1, 2, 3.

Expanded out like this, it's also clear we could calculate \(F^4\) by computing \((F^2)^2\) instead of \(F^4\). This allows us to compute the nth entry in the Fibonacci sequence in \(O(\log n)\) time.

We can go one step better however, computing the nth entry in the Fibonacci sequence in \(O(1)\) time (assuming constant time pow() function) whilst also gaining some intuition as to what happens as it grows.

Eigenvalues

Time to dust off some elementary linear algebra. An eigendecomposition of a matrix represents the matrix in terms of its eigenvalues and eigenvectors. The matrix \(F\) above can be factorized as \(F = Q \Lambda Q^{-1}\), where \(\Lambda\) is a diagonal matrix with the eigenvalues placed on the diagonals and \(Q\) is a matrix composed of the corresponding eigenvectors of \(F\).

An immediate and sane question to ask is why bother representing the matrix this way? Let's use this new matrix to compute the next Fibonacci value.

Other than the two constant matrix multiplications, the only extra work you need to do is raising the \(\Lambda\) matrix to the nth power! This holds true for any power you might want to apply, such that \(F^n = Q \Lambda ^ n Q^{-1}\).

What this simply means is that the eigenvalues are what dictate the growth or death of the result as we perform repeated matrix multiplication.

Using the Fibonacci matrix above, notice that one eigenvalue explodes whilst the other eigenvalue vanishes as we get to larger and larger \(n\) for \(\Lambda^n\). Even at \(n = 10\), we find that one of the eigenvalues is almost irrelevant.

Indeed, if you just multiply a reasonably large Fibonacci number by the largest eigenvalue, you get essentially the next Fibonacci value. It's no surprise that the largest eigenvalue happens to be the Golden ratio, \(\varphi\), or the converged ratio of consecutive Fibonacci numbers.

>>> [np.round(13 * 1.618 ** i) for i in range(5)]
[13, 21, 34, 55, 89]

The summary is that, as \(n\) gets asymptotically large:

  • The matrix \(F^n\) vanishes if the absolute value of all eigenvalues are smaller than 1
  • The matrix \(F^n\) maintains a constant norm if the absolute value of all eigenvalues are 1
  • The matrix \(F^n\) explodes if the absolute value of any eigenvalue is larger than 1

Orthogonal Initialization

Orthogonal matrices review

Orthogonal matrices have many interesting properties but the most important for us is that all the eigenvalues of an orthogonal matrix have absolute value 1. This means that, no matter how many times we perform repeated matrix multiplication, the resulting matrix doesn't explode or vanish.

It's interesting to note what the constraint that an eigenvalue must have absolute value 1 means. If we are only using real numbers, that means the eigenvalues must be either +1 or -1, resulting in very boring matrices. We can extend to complex eigenvalues however, allowing for far more interesting results when repeatedly multiplied.

Even if the eigenvalues are complex, they can still result in matrices that are all real. The simplest example of this is the 2x2 rotation matrix. Whilst the rotation matrix \(R\) below (which performs a 90 degree rotation) is a real matrix, the eigenvalues and eigenvectors are complex.

Orthogonal init for RNNs

For this illustration, we're only looking at a simplified RNN model. For simplicity we assume no inputs, no bias, an identity activation function \(f\), and the initial hidden state of the RNN \(h_0\) is an identity matrix.

When we perform repeated matrix multiplication within an RNN and the result explodes or vanishes, we also land in the realm of vanishing or exploding gradients. If the gradients vanish, training is at a near standstill as no information is backpropagated. If the gradients explode, training may never converge as the gradient update wildly fluctuates. Both these are hugely problematic and can occur in a very small number of timesteps.

When initializing the weight matrix in RNNs, it's not uncommon to use random uniform or random normal initialization. This method gives no guarantee as to the eigenvalues of the weight matrix and are likely to result in exploding or vanishing results.

If we instead use an orthogonal matrix to initialize the weights, the resulting matrix neither explodes nor vanishes. This allows for gradients to backpropagate more effectively.

To construct an orthogonal matrix, we can use singular value decomposition, as seen in the Lasagne init.py.

Visualizing different inits

Let's see what happens to the resulting matrix when performing repeated matrix multiplication over 64 timesteps. Remember that it's also not uncommon for RNNs to run for hundreds or even thousands of timesteps. Neural networks also usually have additional complications such as a non-linear activation function and varying input.

Vanishing matrix due to small eigenvalues

Exploding matrix due to an eigenvalue > 1

Orthogonal matrix

Conclusion

Orthogonal initialization is a simple yet relatively effective way of combatting exploding and vanishing gradients, especially when paired with other methods such as gradient clipping and more advanced architectures. While we haven't gone into detail with mathematical derivation of the gradients, it is clear when visualizing orthogonal initialization why it can be so effective. The story becomes far more complicated when we add in real activation functions and input however.

If you're interested in more detail, I highly recommend the papers below which all explore various aspects of this.

What to look at next

  • Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - "... Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. ... We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos."
  • On the difficulty of training Recurrent Neural Networks - "There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). ... Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem."
  • Regularizing RNNs by Stabilizing Activations - "In the absence of inputs and nonlinearities, a constant norm would imply orthogonality of the hidden-to-hidden transition matrix for simple RNNs (SRNNs). However, in the case of an orthogonal transition matrix, inputs and nonlinearities can still change the norm of the hidden state, resulting in instability. This makes targeting the hidden activations directly a more attractive option for achieving norm stability. Stability becomes especially important when we seek to generalize to longer sequences at test time than those seen during training (the 'training horizon')."
  • Unitary Evolution Recurrent Neural Networks - "When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1."
  • Recurrent Orthogonal Networks and Long-Memory Tasks - "In this work we analyze these “long-memory” tasks, and construct explicit RNN solutions for them. The solutions illuminate both the tasks, and provide a theoretical justification for the success of recent approaches using orthogonal initializations of, or unitary constraints on, the transition matrix of the RNN. ... We verify experimentally that initializing correctly (i.e. random orthogonal or identity) is critical for success on these tasks."
  • Strongly-Typed Recurrent Neural Networks