EMBEDDINGS & SEQUENCE MODELING
SECTION A: RECURRENT NEURAL NETWORKS (10 Questions)
Q1: In a vanilla RNN with update rule h(t) = tanh(U·x(t) + V·h(t-1) + b), what is the
primary computational disadvantage during training?
A. The model requires O(T²) memory to store all intermediate hidden states.
B. The forward pass cannot be parallelized across time steps due to sequential
dependency. [CORRECT]
C. The backward pass can be fully parallelized using modern GPU architectures.
D. The number of parameters scales linearly with sequence length T.
Correct Answer: B
Rationale: Correct because the hidden state h(t) depends on h(t-1), forcing
sequential computation with runtime O(T) that cannot be parallelized across the
time dimension.
Q2: A vanilla RNN is trained on sequences of length T=100. Analysis shows that
gradients with respect to early time step inputs are approximately zero. What is
the most likely cause?
A. The learning rate is too high, causing gradient descent to oscillate.
B. The weight matrix V has spectral radius less than 1, causing vanishing
gradients. [CORRECT]
C. The activation function is ReLU rather than tanh.
D. The input dimension is larger than the hidden dimension.
Correct Answer: B
Rationale: Correct because the Jacobian ∂h(t)/∂h(t-1) involves repeated
multiplication by V; when the spectral radius of V is less than 1, gradients decay
exponentially as V^t, producing vanishing gradients for early time steps.
Q3: Which RNN architecture is most appropriate for sentiment classification,
where a single sentiment label must be produced for an input sentence of
variable length?
A. N-to-N architecture with one output per word.
,B. N-to-1 architecture that maps the final hidden state to a single output.
[CORRECT]
C. 1-to-N architecture that generates a sequence from a single input vector.
D. Encoder-decoder with attention over all intermediate states.
Correct Answer: B
Rationale: Correct because sentiment classification requires mapping a variable-
length input sequence to a single output label, which is precisely the N-to-1
architecture where the final hidden state encodes the entire sequence.
Q4: During training of a vanilla RNN, gradient norms suddenly spike to values
exceeding 1000. Which technique should be applied?
A. Reduce the learning rate by a factor of 10.
B. Apply gradient clipping to bound the maximum gradient norm. [CORRECT]
C. Switch from SGD to Adam optimizer immediately.
D. Increase the hidden state dimension to absorb larger gradients.
Correct Answer: B
Rationale: Correct because exploding gradients occur when the spectral radius of
recurrent weights exceeds 1; gradient clipping directly bounds the gradient norm
during backpropagation through time without modifying the architecture.
Q5: In teacher forcing during RNN training, what input is fed at time step t+1?
A. The model's own predicted output from time step t.
B. The ground-truth target value from the training data at time step t+1.
[CORRECT]
C. A weighted average of the prediction and ground truth.
D. The hidden state from time step t passed through the output layer.
Correct Answer: B
Rationale: Correct because teacher forcing uses the actual training data value as
the next input rather than the model's prediction, which emerges from maximum
likelihood estimation and prevents error accumulation during training.
Q6: A researcher replaces hidden-to-hidden recurrence with teacher forcing at
every time step during both training and inference. What is the primary
consequence?
, A. The model becomes unable to handle variable-length sequences.
B. The model can be parallelized across time steps but loses the ability to
propagate information through hidden states. [CORRECT]
C. The vanishing gradient problem is completely eliminated.
D. The model requires twice as many parameters as a standard RNN.
Correct Answer: B
Rationale: Correct because removing hidden-to-hidden recurrence eliminates the
sequential dependency chain, enabling parallelization, but the model loses the
recurrent path for propagating information across time steps, making it less
powerful than a true RNN.
Q7: Truncated backpropagation through time (BPTT) with truncation parameter
k=10 on sequences of length T=100 means:
A. Only the first 10 time steps are used in the forward pass.
B. Gradients are backpropagated through at most 10 time steps before
truncation. [CORRECT]
C. The hidden state is reset to zero every 10 time steps.
D. The model processes the sequence in 10 non-overlapping chunks.
Correct Answer: B
Rationale: Correct because truncated BPTT limits the temporal span of gradient
computation to k steps, approximating full BPTT while controlling computational
cost and mitigating vanishing/exploding gradients in long sequences.
Q8: Which of the following is NOT a valid criticism of using MLPs for NLP tasks
compared to RNNs?
A. MLPs cannot easily support variable-sized input sequences.
B. MLPs have no inherent mechanism for modeling temporal structure.
C. MLPs require network size to grow with maximum allowed sequence length.
D. MLPs suffer from vanishing gradients across time steps. [CORRECT]
Correct Answer: D
Rationale: Correct because vanishing gradients across time steps is a problem
specific to recurrent architectures with repeated weight multiplication; MLPs