1. Recurrent Neural Network (RNN)
Answer A RNN models sequential interactions through a hidden state, or memory. It can
take up to N inputs and produce up to N outputs. For example, an input sequence may be
a sentence with the outputs being the part-of-speech tag for each word (N-to-N). An input
could be a sentence, and the output a sentiment classification of the sentence (N-to-1). An
input could be a single image, and the output could be a sequence of words corresponding
to the description of an image (1-to-N). At each time step, an RNN calculates a new hidden
state ("memory") based on the current input and the previous hidden state. The
"recurrent" stems from the facts that at each step the same parameters are used and the
network performs the same calculations based on different inputs
2. LSTM (Long Short-Term Memory)
Answer the network was invented to prevent the vanishing gradient problem in Recurrent
Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the
hidden state in an RNN we help to the network to efficiently propagate gradients and learn
long-range dependencies
3.how do RNN and LSTM update rules differ?
Answer LSTM networks update rule is cell state is updated
in an additive way by adding something to its previous value C_t-1, this differs from the
multiplicative update Rule of RNN.
4.Gradients for RNNs
Answer The gradient computation involves recurrent multiplication of WW. This
multiplying by WW to each cell has a bad effect. Think like this: If you a scalar (number)
1/6
, CS-7643 EXAM STUDY GUIDE
and you multiply gradients by it over and over again for say 100 times, if that number > 1,
it'll explode the gradient and if < 1, it'll vanish towards 0.
5.What does t-SNE stand for?
Answer t-Distributed Stochastic Neighbor Embedding
6.what is t-SNE?
Answer is an unsupervised non-linear technique primarily used for data exploration and
visualization of high dimensional data. In short, it gives you a feel or intuition of how data
is in high dimensional space.
7.How does t-SNE conceptually work?
Answer The algorithm calculates a similarity measure between pairs of instances in the
high dimensional space and in the low dimensional space. It then tries to optimize these
two similarity measures using a cost function. Let's break that down into 3 basic steps:
1.) takes the Gaussian distribution around a data point on a 2D plane.
- measure density of all points in gaussian
- renormalize for all points
- give probabilities proportionate to similarities
Note: Gaussian is manipulated by perplexity
2.)take a student t-distribution with one degree of freedom which is also known as the
Cauchy distribution. gives us a second set of probabilities (Qij) in the low dimensional
space
3.)we want these set of probabilities from the low-dimensional space (Qij) to reflect those
of the high dimensional space (Pij) as best as possible. We want the two map structures
to be similar. We measure the difference between the probability distributions of the
two-dimensional spaces using Kullback-Liebler divergence (KL).
2/6