Neural Networks
Instructor: Jaskirat Singh
October 7, 2024
The vanishing and exploding gradient problem are significant challenges in training
Recurrent Neural Networks (RNNs), particularly when working with deep networks or with
long sequences of data. This issue arises primarily due to the way backpropagation is conducted
through time, and it significantly affects the learning process of RNNs. Let’s explore each of
these problems in detail.
1 The Backpropagation Through Time (BPTT) in RNNs
To understand the vanishing and exploding gradient problems, it’s helpful to briefly understand
Backpropagation Through Time (BPTT), which is the algorithm used to train RNNs.
During BPTT, the RNN unfolds over time, effectively forming a very deep network where each
layer represents the RNN’s state at a different timestep.
In BPTT, the gradients of the loss with respect to the weights are computed by propagating
errors back through each time step. However, this process involves repeated multiplication by the
derivative of the activation function and the weight matrix, which can cause gradients to either
shrink exponentially to near zero or grow exponentially to very large values. This phenomenon
is at the root of both the vanishing and exploding gradient problems.
2 The Vanishing Gradient Problem
• Definition: The vanishing gradient problem occurs when gradients become exceedingly
small as they are propagated back through time. As a result, the early layers (or time steps)
receive little to no updates during training. This makes it extremely difficult for the network to
learn dependencies that occur far back in time, hindering the model’s ability to learn long-term
relationships.
• Mathematical Perspective: During backpropagation, each partial derivative involves a
term that often depends on the weights and the activation function’s derivative. Typically,
activation functions like sigmoid or tanh have derivatives in the range (0, 1), meaning that
1