CS7643 QUIZ 5 EXAM WITH CORRECT
QUESTIONS AND ANSWERS 2025
Neural Attention - CORRECT-ANSWERS- weighting or probability distribution over inputs
that depend on computational state and inputs
-HOW IT's Computed?
1. "Hard" - where samples are drawn from the distribution over the input
2. "Soft" - where the distribution is used directly as a weighted average
- Allows information to propagate between distant computational nodes while making minimal
structural assumptions
- Most standard form of attention is softmax
Softmax Properties - CORRECT-ANSWERS- Probabilities sum to one (gives probability
distribution independent of input)
- Performed on sets so invariant to different permutations (permutation invariant)
- Not linear
- Doubling inputs will put more mass on largest input
- Softmax is differentiable
, Softmax Attention vs Final Layer of MLP - CORRECT-ANSWERSAttention:
- q is an internal hidden state, U is the embeddings of input (previous layer)
- distribution corresponds to a summary of U
MLP:
- q is last hidden state, U is embedding of class labels
- distribution corresponds to labelings (outputs)
Position Embedding - CORRECT-ANSWERS- A vector that depend only on the location in the
sequence which is added to an input placed at that location in the sequence.
- Adds information about the absolute and relative locations of inputs
--> Need this in transformer architectures as they are attention based not sequentially based
Transformers - CORRECT-ANSWERS- multi-layer attention model that is state of the art in
most language tasks
- Superior compared to previous attention architectures because:
1. Multi-query hidden-state propagation ("Self-attention") (MOST IMPORTANT THING)
2. Multi-head attention
3. Residual Connections, LayerNorm
QUESTIONS AND ANSWERS 2025
Neural Attention - CORRECT-ANSWERS- weighting or probability distribution over inputs
that depend on computational state and inputs
-HOW IT's Computed?
1. "Hard" - where samples are drawn from the distribution over the input
2. "Soft" - where the distribution is used directly as a weighted average
- Allows information to propagate between distant computational nodes while making minimal
structural assumptions
- Most standard form of attention is softmax
Softmax Properties - CORRECT-ANSWERS- Probabilities sum to one (gives probability
distribution independent of input)
- Performed on sets so invariant to different permutations (permutation invariant)
- Not linear
- Doubling inputs will put more mass on largest input
- Softmax is differentiable
, Softmax Attention vs Final Layer of MLP - CORRECT-ANSWERSAttention:
- q is an internal hidden state, U is the embeddings of input (previous layer)
- distribution corresponds to a summary of U
MLP:
- q is last hidden state, U is embedding of class labels
- distribution corresponds to labelings (outputs)
Position Embedding - CORRECT-ANSWERS- A vector that depend only on the location in the
sequence which is added to an input placed at that location in the sequence.
- Adds information about the absolute and relative locations of inputs
--> Need this in transformer architectures as they are attention based not sequentially based
Transformers - CORRECT-ANSWERS- multi-layer attention model that is state of the art in
most language tasks
- Superior compared to previous attention architectures because:
1. Multi-query hidden-state propagation ("Self-attention") (MOST IMPORTANT THING)
2. Multi-head attention
3. Residual Connections, LayerNorm