2026/2027 | Actual Questions | Complete Solution | Georgia
Tech | Pass Guaranteed - A+ Graded
Section 1: Data Preparation, Validation & Basic Modeling (Questions
1-12)
Q1. A data scientist is preparing a dataset with 10,000 observations and 150 features
for predictive modeling. Approximately 8% of values are missing at random across
multiple features. Which missing data handling strategy is MOST appropriate?
A. Listwise deletion (remove any row with any missing value)
B. Mean imputation for all missing values followed by standardization
C. Multiple imputation or model-based imputation (e.g., mice, missForest) that
preserves uncertainty and relationships
D. Replace all missing values with zero
Rationale: With 8% missingness across many features, listwise deletion (Option A)
would discard ~50%+ of data. Mean imputation (Option B) distorts variance and
covariances. Zero imputation (Option D) introduces severe bias. Multiple imputation
(Option C) accounts for uncertainty and preserves relationships between variables.
Correct Answer: C
,Q2. A modeler partitions data into 70% training, 15% validation, and 15% test sets. After
tuning hyperparameters on the validation set, the test set accuracy is 82%. The model is
then retrained on the full dataset (100%) with the selected hyperparameters and
deployed. A colleague argues the 82% is an unbiased estimate of future performance.
Which statement is CORRECT?
A. The colleague is correct; the test set provides an unbiased estimate of the final
model's performance
B. The test set accuracy is biased upward because the final model was trained on more
data, including the test set
C. The test set accuracy is a reasonable but slightly optimistic estimate; the final model
may perform slightly better due to more training data, but the test set was never used
for model selection, so 82% remains a valid estimate
D. The test set is now invalid because the model was retrained; a new test set must be
collected
Rationale: The test set was never used for model selection or hyperparameter tuning, so
the 82% remains a valid, unbiased estimate of the model class's performance.
Retraining on the full dataset is standard practice and typically improves performance
slightly, but the test set estimate is still valid for the model architecture selected. Option
A overstates by saying "unbiased" for the exact final model. Option B is wrong—the test
set wasn't used for training. Option D is excessive.
Correct Answer: C
Q3. An analyst standardizes features using z-score normalization: x
,′
=
σ
x−μ
. Which statement about the transformed data is TRUE?
A. The transformed data will have mean = 0 and standard deviation = 1, but outliers
remain unchanged in relative position
B. The transformed data will have median = 0 and interquartile range = 1
C. The transformed data will have mean = 0 and standard deviation = 1, and outliers are
automatically removed
D. The transformed data will have minimum = 0 and maximum = 1
Rationale: Z-score standardization produces mean = 0 and standard deviation = 1. It
does not remove outliers—it only rescales them. The relative position (z-score) of
outliers remains extreme. Option B describes robust scaling, not z-score. Option C is
false—outliers are not removed. Option D describes min-max scaling.
Correct Answer: A
Q4. A modeler uses 5-fold cross-validation to estimate model performance on a dataset
with 500 observations. Which statement accurately describes the procedure?
, A. The data is split into 5 equal folds; the model is trained on 4 folds and tested on 1
fold; this is repeated 5 times so each fold serves as the test set once; the 5 test errors
are averaged
B. The data is split into 5 folds; the model is trained on 1 fold and tested on 4 folds; this
is repeated 5 times
C. The data is split into 5 folds; 4 models are trained, each on a different fold, and tested
on the remaining data
D. The data is split randomly 5 times into 50% training and 50% test sets, and the errors
are averaged
Rationale: In k-fold CV, data is split into k folds; the model trains on k-1 folds and
validates on the remaining fold. This rotates k times so each fold is the validation set
once. The k validation errors are averaged. Option B reverses train/test sizes. Option C
describes a different procedure. Option D describes repeated random subsampling, not
k-fold CV.
Correct Answer: A
Q5. A data scientist notices that after feature scaling, a k-NN model's accuracy
improved from 68% to 89%. Which explanation BEST accounts for this improvement?
A. Scaling reduced the dimensionality of the feature space
B. Scaling ensured that features with larger original scales did not dominate the
distance metric
C. Scaling introduced nonlinearity into the model
D. Scaling removed multicollinearity between features