100 UNIQUE COMPREHENSIVE EXAM QUESTIONS
(MIDTERM 1)
Expert-Verified Rationales & High-Yield Concepts (2026 Edition)
Q1: When building a Support Vector Machine (SVM) model, what is the
'classifier' actually representing mathematically?
Key Answer: A hyperplane that separates data points into different classes.
Detailed Rationale: SVM works by finding the optimal linear boundary (hyperplane) that
maximizes the distance between categories.
Q2: Why is it necessary to scale data (e.g., to a range of 0-1) before using it in
an SVM model?
Key Answer: To ensure variables with larger units don't unfairly dominate the distance
calculation.
Detailed Rationale: Distance-based models like SVM are sensitive to the magnitude of
numbers. Scaling ensures all features contribute equally.
Q3: What happens to the 'margin' in an SVM if we increase the value of the 'C'
parameter (soft margin)?
Key Answer: The margin becomes narrower to allow fewer misclassifications.
Detailed Rationale: A high C value penalizes misclassifications heavily, leading to a
smaller margin and potentially overfitting.
,Q4: Which kernel would you use in SVM if your data points were clearly
separated by a circular boundary rather than a straight line?
Key Answer: Radial Basis Function (RBF) or Polynomial kernel.
Detailed Rationale: Linear kernels only work for straight-line separations. RBF kernels
project data into higher dimensions to handle non-linear boundaries.
Q5: In SVM, what are 'Support Vectors'?
Key Answer: The data points located closest to the separating hyperplane.
Detailed Rationale: Support vectors are the critical points that define the position and
orientation of the margin.
Q6: What is the primary risk of using the same data for both training and
evaluating a model?
Key Answer: Overfitting.
Detailed Rationale: The model may 'memorize' the noise in the training set, leading to
poor performance on new, unseen data.
Q7: In K-fold Cross-Validation, if K=5, how many times is the model trained?
Key Answer: 5 times.
Detailed Rationale: The data is split into 5 parts; each part acts as the validation set
once while the other 4 act as training data.
Q8: What is the 'Validation Set' used for in the Training-Validation-Testing
workflow?
Key Answer: To tune hyperparameters and select the best version of the model.
Detailed Rationale: The training set builds the model, but the validation set helps you
decide which settings (like 'K' in K-nearest neighbors) work best.
, Q9: What does a C-statistic (AUC) of 0.5 indicate about a classifier's
performance?
Key Answer: The model is no better than random guessing.
Detailed Rationale: An AUC of 0.5 means the model has zero discriminative power. 1.0
is a perfect model.
Q10: Why should the 'Test Set' be kept in a 'vault' until the very end of the
project?
Key Answer: To provide a completely unbiased estimate of real-world performance.
Detailed Rationale: If you use the test set during tuning, the model 'leaks' info from it
and the results become overly optimistic.
Q11: What is the main difference between Classification and Clustering?
Key Answer: Classification is supervised (labeled data); Clustering is unsupervised
(unlabeled data).
Detailed Rationale: In clustering, you don't know the 'correct' groups beforehand; the
algorithm finds patterns on its own.
Q12: How does the K-Means algorithm decide which cluster a data point
belongs to?
Key Answer: It assigns the point to the cluster with the nearest mean (centroid).
Detailed Rationale: K-Means minimizes the distance between points and their
respective cluster centers.
Q13: What is the 'Elbow Method' used for in clustering?
Key Answer: To determine the optimal number of clusters (K).
Detailed Rationale: It plots the variance explained as a function of K; the 'elbow' is
where adding more clusters gives diminishing returns.