CS7643 QUIZ 5 QUESTIONS WITH DETAILED VERIFIED
ANSWERS (100% CORRECT ANSWERS) /ALREADY
GRADED A +
Neural Attention - ANSWER-- weighting or probability distribution over inputs that depend on
computational state and inputs
-HOW IT's Computed?
1. "Hard" - where samples are drawn from the distribution over the input
2. "Soft" - where the distribution is used directly as a weighted average
- Allows information to propagate between distant computational nodes while making minimal structural
assumptions
- Most standard form of attention is softmax
Softmax Properties - ANSWER-- Probabilities sum to one (gives probability distribution independent of
input)
- Performed on sets so invariant to different permutations (permutation invariant)
- Not linear
- Doubling inputs will put more mass on largest input
- Softmax is differentiable
Softmax Attention vs Final Layer of MLP - ANSWER-Attention:
- q is an internal hidden state, U is the embeddings of input (previous layer)
- distribution corresponds to a summary of U
MLP:
- q is last hidden state, U is embedding of class labels
- distribution corresponds to labelings (outputs)
, Position Embedding - ANSWER-- A vector that depend only on the location in the sequence which is
added to an input placed at that location in the sequence.
- Adds information about the absolute and relative locations of inputs
--> Need this in transformer architectures as they are attention based not sequentially based
Transformers - ANSWER-- multi-layer attention model that is state of the art in most language tasks
- Superior compared to previous attention architectures because:
1. Multi-query hidden-state propagation ("Self-attention") (MOST IMPORTANT THING)
2. Multi-head attention
3. Residual Connections, LayerNorm
Transformers: Self Attention (Multi-query hidden-state propagation) - ANSWER-- improves on softmax
attention by having a controller for every input (size of controller state grows with input size)
Transformers: Multi-head attention - ANSWER-- combines multiple attention 'heads' being trained in the
same way on the same data - but with different weight matrices
- each of the L attention heads yields values for each token; these values are then multiplied by trained
parameters and added
Causal Attention - ANSWER-- Attention mask (way of putting a graph structure on transformer)
- Masks out attention weights that don't go from left to right
--> training code outputs a prediction at each token simultaneously (and takes gradients simultaneously)
--> massively speeds up training (by the size of the context)
--> Not necessary for masked language models like BERT
Attention vs. Seq2Seq Modeling - ANSWER-- Seq2Seq passes a single context (the last hidden state) to
the decoder, Attention passes all hidden states to the decoder
- Decoder computes a weighted sum of all hidden states to determine single context vector
BERT is a stack of - ANSWER-Encoder Modules
ANSWERS (100% CORRECT ANSWERS) /ALREADY
GRADED A +
Neural Attention - ANSWER-- weighting or probability distribution over inputs that depend on
computational state and inputs
-HOW IT's Computed?
1. "Hard" - where samples are drawn from the distribution over the input
2. "Soft" - where the distribution is used directly as a weighted average
- Allows information to propagate between distant computational nodes while making minimal structural
assumptions
- Most standard form of attention is softmax
Softmax Properties - ANSWER-- Probabilities sum to one (gives probability distribution independent of
input)
- Performed on sets so invariant to different permutations (permutation invariant)
- Not linear
- Doubling inputs will put more mass on largest input
- Softmax is differentiable
Softmax Attention vs Final Layer of MLP - ANSWER-Attention:
- q is an internal hidden state, U is the embeddings of input (previous layer)
- distribution corresponds to a summary of U
MLP:
- q is last hidden state, U is embedding of class labels
- distribution corresponds to labelings (outputs)
, Position Embedding - ANSWER-- A vector that depend only on the location in the sequence which is
added to an input placed at that location in the sequence.
- Adds information about the absolute and relative locations of inputs
--> Need this in transformer architectures as they are attention based not sequentially based
Transformers - ANSWER-- multi-layer attention model that is state of the art in most language tasks
- Superior compared to previous attention architectures because:
1. Multi-query hidden-state propagation ("Self-attention") (MOST IMPORTANT THING)
2. Multi-head attention
3. Residual Connections, LayerNorm
Transformers: Self Attention (Multi-query hidden-state propagation) - ANSWER-- improves on softmax
attention by having a controller for every input (size of controller state grows with input size)
Transformers: Multi-head attention - ANSWER-- combines multiple attention 'heads' being trained in the
same way on the same data - but with different weight matrices
- each of the L attention heads yields values for each token; these values are then multiplied by trained
parameters and added
Causal Attention - ANSWER-- Attention mask (way of putting a graph structure on transformer)
- Masks out attention weights that don't go from left to right
--> training code outputs a prediction at each token simultaneously (and takes gradients simultaneously)
--> massively speeds up training (by the size of the context)
--> Not necessary for masked language models like BERT
Attention vs. Seq2Seq Modeling - ANSWER-- Seq2Seq passes a single context (the last hidden state) to
the decoder, Attention passes all hidden states to the decoder
- Decoder computes a weighted sum of all hidden states to determine single context vector
BERT is a stack of - ANSWER-Encoder Modules