EXAM QUESTIONS WITH CORRECT
SOLUTIONS||UPDATED 2026/2027
SYLLABUS||<<NEWEST VERSION>>
Which ones of the following statements on Residual Networks are true? (Check all
that apply.)
A: Using a skip-connection helps the gradient to backpropagate and thus helps you
to train deeper networks
B: A ResNet with L layers would have on the order of L2 skip connections in total.
C: The skip-connections compute a complex non-linear function of the input to
pass to a deeper layer in the network.
D: The skip-connection makes it easy for the network to learn an identity mapping
between the input and the output within the ResNet block. - ANSWER ✓ AD
Suppose you have an input volume of dimension 64x64x16. How many
parameters would a single 1x1 convolutional filter have (including the bias)?
A: 2
B: 4097
C: 1
D: 17 - ANSWER ✓ D
Suppose you have an input volume of dimension $n_H \times n_W \times n_C$.
Which of the following statements you agree with? (Assume that 1x1
convolutional layer below always uses a stride of 1 and no padding.)
A: You can use a 1x1 convolutional layer to reduce $n_C$ but not $n_H$, $n_W$.
B: You can use a 1x1 convolutional layer to reduce $n_H$, $n_W$, and $n_C$.
C: You can use a pooling layer to reduce $n_H, n_W$, but not $n_C$.
D: You can use a pooling layer to reduce $n_H, n_W$, and $n_C$. - ANSWER ✓
AC
Which ones of the following statements on Inception Networks are true? (Check
all that apply.)
A: A single inception block allows the network to use a combination of 1x1, 3x3,
5x5 convolutions and pooling.
,B: Making an inception network deeper (by stacking more inception blocks
together) should not hurt training set performance.
C: Inception blocks usually use 1x1 convolutions to reduce the input data volume's
size before applying 3x3 and 5x5 convolutions.
D: Inception networks incorporates a variety of network architectures (similar to
dropout, which randomly chooses a network architecture on each step) and thus
has a similar regularizing effect as dropout. - ANSWER ✓ AC
Which of the following are common reasons for using open-source
implementations of ConvNets (both the model and/or weights)? Check all that
apply.
A: A model trained for one computer vision task can usually be used to perform
data augmentation even for a different computer vision task.
B: It is a convenient way to get working an implementation of a complex ConvNet
architecture.
C: The same techniques for winning computer vision competitions, such as using
multiple crops at test time, are widely used in practical deployments (or production
system deployments) of ConvNets.
D: Parameters trained for one computer vision task are often useful as pretraining
for other computer vision tasks. - ANSWER ✓ BD
You are building a 3-class object classification and localization algorithm. The
classes are: pedestrian ($c=1$), car ($c=2$), motorcycle ($c=3$). What would be
the label for the following image? Recall $y=[p_c,b_x,b_y,b_h,b_w,c_1,c_2,c_3]$
A: $y=[1,0.3,0.7,0.3,0.3,0,1,0]$
B: $y=[1,0.7,0.5,0.3,0.3,0,1,0]$
C: $y=[1,0.3,0.7,0.5,0.5,0,1,0]$
D: $y=[1,0.3,0.7,0.5,0.5,1,0,0]$
E: $y=[0,0.2,0.4,0.5,0.5,0,1,0]$ - ANSWER ✓ A
Continuing from the previous problem, what should y be for the image below?
Remember that "?" means don't care, which means that the neural network loss
function won't care what the neural network gives for that component of the
output. As before, $y=[p_c,b_x,b_y,b_h,b_w,c_1,c_2,c_3]$.
A: $[?,?,?,?,?,?,?,?]$
B: $[1,?,?,?,?,0,0,0]$
C: $[1,?,?,?,?,?,?,?]$
,D: $[0,?,?,?,?,?,?,?]$
E: $[0,?,?,?,?,0,0,0]$ - ANSWER ✓ D
You are working on a factory automation task. Your system will see a can of soft-
drink coming down a conveyor belt, and you want it to take a picture and decide
whether (i) there is a soft-drink can in the image, and if so (ii) its bounding box.
Since the soft-drink can is round, the bounding box is always square, and the soft
drink can always appears as the same size in the image. There is at most one soft
drink can in each image. Here're some typical images in your training set: What is
the most appropriate set of output units for your neural network?
A: Logistic unit (for classification)
B: Logistic unit, $b_x$ and $b_y$
C: Logistic unit, $b_x$, $b_y$, $b_h$ (since $b_w = b_h$)
D: Logistic unit, $b_x$, $b_y$, $b_h$, $b_w$ - ANSWER ✓ B
If you build a neural network that inputs a picture of a person's face and outputs
$N$ landmarks on the face (assume the input image always contains exactly one
face), how many output units will the network have?
A: $N$
B: $2N$
C: $3N$
D: $N^2$ - ANSWER ✓ B
When training one of the object detection systems described in lecture, you need a
training set that contains many pictures of the object(s) you wish to detect.
However, bounding boxes do not need to be provided in the training set, since the
algorithm can learn to detect the objects by itself.
A: True
B: False - ANSWER ✓ B
Suppose you are applying a sliding windows classifier (non-convolutional
implementation). Increasing the stride would tend to increase accuracy, but
decrease computational cost.
A: True
B: False - ANSWER ✓ B
In the YOLO algorithm, at training time, only one cell ---the one containing the
center/midpoint of an object--- is responsible for detecting this object.
, A: True
B: False - ANSWER ✓ A
What is the IoU between these two boxes? The upper-left box is 2x2, and the
lower-right box is 2x3. The overlapping region is 1x1.
A: 1/6
B: 1/9
C: 1/10
D: None of the above - ANSWER ✓ B
Suppose you run non-max suppression on the predicted boxes above. The
parameters you use for non-max suppression are that boxes with probability <= 0.4
are discarded, and the IoU threshold for deciding if two boxes overlap is 0.5. How
many boxes will remain after non-max suppression?
A: 3
B: 4
C: 5
D: 6
E: 7 - ANSWER ✓ C
Suppose you are using YOLO on a 19x19 grid, on a detection problem with 20
classes, and with 5 anchor boxes. During training, for each image you will need to
construct an output volume y as the target value for the neural network; this
corresponds to the last layer of the neural network. (y may include some "?", or
"don't cares"). What is the dimension of this output volume?
A: 19x19x(5x25)
B: 19x19x(5x20)
C: 19x19x(25x20)
D: 19x19x(20x25) - ANSWER ✓ A
Face verification requires comparing a new picture against one person's face,
whereas face recognition requires comparing a new picture against K person's
faces.
A: True
B: False - ANSWER ✓ A
Why do we learn a function $d(img1,img2)$ for face verification? (Select all that
apply.)