Questions and 100% Verified Correct Answers
Guaranteed A+
Collobert & Weston Vectors - CORRECT ANSWER: a word and its context is a positive
example, a negative example is placing a random word in the context of the original
words. Similar to SVM, the algorithm uses margin loss to increase the margin between
positive and negative examples.
Conditional Language Modeling - CORRECT ANSWER: allows you to determine the
probability of a sequence of words conditioned on context c.
Connection between Negative Sampling and Collobert & Weston Algorithm? -
CORRECT ANSWER: - NS is similar to collobert and weston's algorithm by training
network to distinguish "good" word context pairs from bad ones
- Collobert & Weston uses margin loss and NS uses probabilistic optimization
Distribution Semantics - CORRECT ANSWER: a word's meaning is given by words that
are frequently around it.
Extrinsic Evaluation for word embedding? - CORRECT ANSWER: e.g. Text
Classification
1.) Evaluation of real task
2.) can take long time to compute
3.) unclear is the subsystem is the problem or its interaction
4.) if replace one subsystem with another and performance improves thats a win
Gradients for RNNs - CORRECT ANSWER: The gradient computation involves
recurrent multiplication of WW. This multiplying by WW to each cell has a bad effect.
Think like this: If you a scalar (number) and you multiply gradients by it over and over
, again for say 100 times, if that number > 1, it'll explode the gradient and if < 1, it'll
vanish towards 0.
Graph Embeddings - CORRECT ANSWER: Generalization of word embedding
how do RNN and LSTM update rules differ? - CORRECT ANSWER: LSTM networks
update rule is cell state is updated in an additive way by adding something to its
previous value C_t-1, this differs from the multiplicative update Rule of RNN.
How does t-SNE conceptually work? - CORRECT ANSWER: The algorithm calculates a
similarity measure between pairs of instances in the high dimensional space and in the
low dimensional space. It then tries to optimize these two similarity measures using a
cost function. Let's break that down into 3 basic steps:
1.) takes the Gaussian distribution around a data point on a 2D plane.
- measure density of all points in gaussian
- renormalize for all points
- give probabilities proportionate to similarities
Note: Gaussian is manipulated by perplexity
2.)take a student t-distribution with one degree of freedom which is also known as the
Cauchy distribution. gives us a second set of probabilities (Qij) in the low dimensional
space
3.)we want these set of probabilities from the low-dimensional space (Qij) to reflect
those of the high dimensional space (Pij) as best as possible. We want the two map
structures to be similar. We measure the difference between the probability distributions
of the two-dimensional spaces using Kullback-Liebler divergence (KL).
How to calculate probability of context word wrt to center word? - CORRECT ANSWER:
-take the inner product of U and V to measure how likely word w appears with context
word o.
-U when w is a center word
-V when o is a context word