Questions | Georgia Tech 2026 | Grade A –
Verified Solutions
SECTION 1: INTRODUCTION & MODELING FRAMEWORK
(Questions 1–20)
1. Which of the following best defines “analytics modeling”?
A. The process of storing large amounts of data
B. The use of mathematical and statistical methods to extract insights and support
decision-making from data
C. The design of efficient database schemas
D. The creation of data visualization dashboards
Rationale: Analytics modeling focuses on building models (statistical, machine learning,
optimization) to understand patterns, predict outcomes, or prescribe actions. Storage,
schemas, and dashboards are related but not definitions of modeling.
2. In the analytics modeling framework, what is the first step after defining the business
problem?
A. Build a complex model immediately
B. Collect and prepare relevant data
C. Deploy the model into production
D. Validate model assumptions
Rationale: Once the problem is defined, the next critical step is obtaining and cleaning
the data that will be used to build the model. Data preparation is usually the most
time-consuming part.
3. A model that is too complex for the amount of available data is most likely to suffer
from:
A. Underfitting
B. Overfitting
C. Bias-variance tradeoff balance
D. Cold start problem
Rationale: Overfitting occurs when a model learns noise and random fluctuations in the
training data rather than the underlying pattern. This happens when model complexity is
high relative to data size.
,4. Which of the following tasks is an example of supervised learning?
A. Customer segmentation using k-means
B. Predicting house prices using historical sales data
C. Dimensionality reduction with PCA
D. Association rule mining for market basket analysis
Rationale: Supervised learning requires labeled output/target variable. House price
prediction uses past sales (with prices) as labels. Clustering, PCA, and association rules
are unsupervised.
5. Which of the following tasks is an example of unsupervised learning?
A. Classification of emails as spam or not spam
B. Clustering customers into groups based on purchasing behavior
C. Predicting temperature from weather features
D. Estimating the probability of loan default
Rationale: Unsupervised learning finds hidden structures without labeled outcomes.
Clustering is a classic unsupervised task.
6. The “bias-variance tradeoff” implies that:
A. Increasing model complexity always reduces test error
B. As model complexity increases, bias typically decreases and variance increases
C. Bias and variance move in the same direction
D. Simple models always have high variance
Rationale: The tradeoff: simple models (high bias, low variance) underfit; complex
models (low bias, high variance) overfit. The goal is to find a balance that minimizes
total error.
7. Holdout validation splits the data into:
A. Only training and test sets
B. Training, validation, and test sets (often training+validation vs test)
C. Only training set
D. K equal sized folds
Rationale: Holdout uses one partition (e.g., 70% training, 30% testing) or
training/validation/test. K-fold cross-validation uses multiple folds, but holdout uses a
single split.
8. In k-fold cross-validation, what is the purpose of the validation folds?
A. To train the final model
B. To estimate the model’s performance on unseen data
C. To increase the training set size
D. To select features
Rationale: Each fold is used once as validation to compute an out-of-sample error
,estimate; the model is trained on the other k-1 folds. This provides a more robust
performance estimate than a single holdout.
9. Which statement about the confusion matrix is correct?
A. It is only used for regression problems
B. It shows the counts of true positives, false positives, true negatives, and false
negatives
C. It cannot be used for multi-class classification
D. It does not depend on the chosen threshold
Rationale: A confusion matrix is a table for classification. For binary classification it
contains TP, FP, TN, FN. Multi-class extensions exist. It depends on the decision
threshold.
10. Sensitivity (recall) is defined as:
A. TP / (TP + FP)
B. TP / (TP + FN)
C. TN / (TN + FP)
D. (TP + TN) / (TP + TN + FP + FN)
Rationale: Sensitivity measures the proportion of actual positives correctly identified.
Formula: TP / (TP + FN). Precision is TP/(TP+FP); specificity is TN/(TN+FP).
11. Precision is defined as:
A. TP / (TP + FP)
B. TP / (TP + FN)
C. TN / (TN + FP)
D. (TP + TN) / total
Rationale: Precision answers: "Of all predicted positives, how many were actually
positive?" High precision means low false positive rate.
12. The F1 score is the harmonic mean of:
A. Accuracy and recall
B. Precision and recall
C. Sensitivity and specificity
D. Precision and accuracy
Rationale: F1 = 2 × (precision × recall) / (precision + recall). It balances precision and
recall, useful for imbalanced classes.
13. Underfitting is characterized by:
A. High variance on test data
B. High bias and poor performance on both training and test data
C. Zero training error
, D. Complex decision boundaries
Rationale: Underfitting occurs when a model is too simple to capture the pattern,
leading to high bias and poor fit on training data, which extends to test data.
14. The ROC curve plots:
A. Precision vs recall
B. True positive rate (TPR) vs false positive rate (FPR)
C. Accuracy vs model complexity
D. Sensitivity vs specificity
Rationale: ROC (Receiver Operating Characteristic) curve shows TPR (sensitivity) on
y-axis and FPR (1-specificity) on x-axis. AUC summarizes performance.
15. A model with AUC = 0.5 indicates:
A. Perfect classification
B. Random guessing (no discriminative power)
C. Slightly better than random
D. The model always predicts the majority class
Rationale: AUC of 0.5 means the classifier’s performance is equivalent to flipping a coin.
Values >0.7 are usually considered acceptable; 1.0 is perfect.
16. Which is a common way to handle missing numeric data?
A. Delete any row with a missing value regardless of context
B. Imputation using mean, median, or model-based methods
C. Replace missing values with zero
D. Ignore missing values
Rationale: Imputation is a standard technique to retain data while handling
missingness. Mean/median imputation is simple; more advanced methods (k-NN,
regression) can be used.
17. Outliers in a dataset can be detected using:
A. Correlation matrix
B. Boxplots (IQR method) or z-scores
C. Linear regression coefficients
D. Confusion matrix
Rationale: Outliers are often identified by values beyond 1.5×IQR from quartiles or
z-scores >3. Boxplots visually show outliers.
18. Normalization (min-max scaling) transforms features to the range:
A. [-1, 1]
B. Typically [0, 1]
C. (-∞, ∞)