A+ Grade
Embedding
- correct answer A learned map from entities to vectors that encodes similarity
Graph Embedding
- correct answer Optimize the objective that connected nodes have more similar embeddings than
unconnected nodes.
Task: convert nodes to vectors
- effectively unsupervised learning where nearest neighbors are similar
- these learned vectors are useful for downstream tasks
Multi-layer Perceptron (MLP) pain points for NLP
- correct answer - Cannot easily support variable-sized sequences as inputs or outputs
- No inherent temporal structure
- No practical way of holding state
- The size of the network grows with the maximum allowed size of the input or output sequences
Truncated Backpropagation through time
- correct answer - Only backpropagate a RNN through T time steps
Recurrent Neural Networks (RNN)
- correct answer h(t) = activation(U*input + V*h(t-1) + bias)
y(t) = activation(W*h(t) + bias)
, - activation is typically the logistic function or tanh
- outputs can also simply be h(t)
- family of NN architectures for modeling sequences
Training Vanilla RNN's difficulties
- correct answer - Vanishing gradients
- Since dx(t)/dx(t-1) = w^t
- if w > 1: exploding gradients
- if w < 1: vanishing gradients
Long Short-Term Memory Network Gates and States
- correct answer - f(t) = forget gate
- i(t) = input gate
- u(t) = candidate update gate
- o(t) = output gate
- c(t) = cell state
- c(t) = f(t) * c(t - 1) + i(t) * u(t)
- h(t) = hidden state
- h(t) = o(t) * tanh(c(t))
Perplexity(s)
- correct answer = product( 1 / P(w(i) | w(i-1), ...) ) ^ (1 / N)
= b ^ (-1/N sum( log(b) (P(w(i) | w(i-1), ...) ) )
- note exponent of b is per word CE loss
- perplexity of a discrete uniform distribution over k events is k
Language Model Goal
- correct answer - estimate the probability of sequences of words