Modeling | 100% Correct Certification | Complete Q&A | Pass
Guaranteed - A+ Graded
Section 1: Data Preparation & Basic Modeling (Q1-12)
Q1
A data scientist is preparing a dataset for modeling and discovers that 15% of values in
the "customer_age" column are missing completely at random (MCAR). The remaining
values are normally distributed. Which imputation method is most appropriate?
A. Replace all missing values with the mode of the column
B. Use mean imputation or multiple imputation by chained equations (MICE) [CORRECT]
C. Delete all rows with missing values to preserve model integrity
D. Replace missing values with a constant value of 999 to flag them
Rationale: For MCAR normally distributed data, mean imputation preserves the mean
without bias, while MICE accounts for uncertainty. Mode imputation is for categorical
data. Listwise deletion loses 15% of data and may introduce bias. Arbitrary constants
(999) distort distributions and model behavior.
Correct Answer: B
Q2
A modeler is comparing two scaling approaches for a k-NN classifier: min-max scaling
to [0,1] and z-score standardization. The dataset contains features with vastly different
,scales (income in thousands, age in years, binary flags). Which statement about scaling
is correct?
A. Scaling is unnecessary for k-NN because Euclidean distance is scale-invariant
B. Min-max scaling is preferred when outliers are present because it is robust to
extreme values
C. Z-score standardization is generally preferred when features have outliers, as it is
less sensitive to extreme values than min-max [CORRECT]
D. Decision trees require z-score standardization to perform feature splits correctly
Rationale: Z-score standardization (subtract mean, divide by SD) is less affected by
outliers than min-max, which compresses all data into a fixed range. k-NN is highly
scale-sensitive. Decision trees are scale-invariant and do not require standardization.
Correct Answer: C
Q3
A team is building a predictive model using customer transaction data from 2019-2025.
They plan to use data from 2019-2023 for training and 2024-2025 for testing. A
colleague suggests randomly shuffling all data before splitting. What is the primary
issue with this suggestion?
A. Random shuffling ensures better class balance and should always be preferred
B. For time-dependent data, random shuffling destroys temporal structure and may
cause data leakage from future to past [CORRECT]
C. The proposed chronological split is invalid because it does not use stratified
sampling
D. Time series data should never be split; all data must be used for both training and
testing
Rationale: Random shuffling of temporal data creates look-ahead bias—models may
learn from future patterns to predict the past. Chronological splitting preserves
,temporal causality. Stratified sampling addresses class imbalance but does not solve
temporal leakage. All data cannot be used for both sets simultaneously.
Correct Answer: B
Q4
An analyst encounters a dataset where 40% of rows have at least one missing value,
and the missingness appears related to higher income levels (MAR—missing at random
conditional on observed data). Which approach is most statistically sound?
A. Delete all rows with missing values since 40% is manageable
B. Use multiple imputation that models the missingness mechanism conditional on
observed variables [CORRECT]
C. Replace all missing values with zeros to maintain dataset size
D. Use mean imputation without considering the missingness mechanism
Rationale: MAR data requires methods that account for the conditional relationship
between missingness and observed variables. Multiple imputation (e.g., MICE) models
this relationship. Deleting 40% loses substantial information and may bias results. Zero
imputation and naive mean imputation distort relationships.
Correct Answer: B
Q5
A modeler needs to create a validation set from 10,000 observations. The target
variable is binary with a 95:5 class imbalance. Which splitting strategy best ensures the
validation set is representative?
A. Simple random sampling with 80/20 split
, B. Stratified random sampling that preserves the 95:5 class ratio in both sets
[CORRECT]
C. Use all majority class for training and all minority class for validation
D. Oversample the minority class before splitting to create 50:50 balance in the original
data
Rationale: Stratified sampling ensures both training and validation sets maintain the
original class distribution, enabling reliable performance estimation. Simple random
sampling may create sets with no minority cases. Separating classes entirely prevents
learning. Oversampling before splitting leaks test information into training.
Correct Answer: B
Q6
An R user runs summary(df) and observes that a numeric feature has minimum =
-999, maximum = 150, mean = 42, and median = 45. What is the most likely data quality
issue?
A. The feature has a normal distribution with slight left skew
B. -999 is likely a sentinel value for missing data that was not properly coded as NA
[CORRECT]
C. The mean being lower than the median indicates right-skewed data
D. The maximum value of 150 is an outlier that should be removed
Rationale: -999 is a common sentinel value for missing data in older datasets or
systems that do not support NA. It artificially depresses the mean and creates spurious
minimum values. The mean (42) being less than median (45) actually indicates left
skew, but the extreme minimum is the dominant signal.
Correct Answer: B