DATA MINING AND STAT LEARN FINAL PAPER
2026 COMPLETE QUESTIONS AND ANSWERS
GRADED A+
◉ SVM Pros/Cons. Answer: Pros: It works really well with a clear
margin of separation
It is effective in high dimensional spaces.
It is effective in cases where the number of dimensions is greater
than the number of samples.
It uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.
Cons: Not good for very large data sets
Not good for when the data set has more noise i.e. target classes are
overlapping
Doesn't directly provide probability estimates.
◉ K-nearest neighbor (K-NN). Answer: An unsupervised
classification algorithm. Looks at the X number of closest points to
the new one and classifies as whichever is most common.
◉ K-nearest neighbor (K-NN) Pros/Cons. Answer: Pros: No
assumptions about data
Easy to understand/Interpret
,Varsatile
Cons: Computationally expensive because algorithm stores all
training data
Sensitive to irrelevant features and scale of data
◉ k-fold cross validation. Answer: Validation Technique where data
is divided into X number of data subsets. Each subset is then used as
a for testing while the rest are used for training. The algorithm then
rotates through each subset and averages the results
◉ K Fold cross Validation Pros/Cons. Answer: Pros: Validates
Performance of model
Can create balance across predicted features classes
Cons: Doesn't work well with time series data
The aggregate scores of your model could miss some important
extreme values or overpower them so theyre harder to pick up on
◉ k-means clustering. Answer: Unsupervised learning heuristic that
sets x starts by assigning x number of cluster centers, then clusters
all data points into each of them based on distance. The center point
of each cluster is then calculated and all data points are again re
clustered. Repeat process until no-data points change clusters. Ideal
number of clusters can be identified via elbow diagram.
, ◉ k-means pros and cons. Answer: Pros: Simple to implement
Scales well to large data sets
Easily adaptable
Cons: Choosing K manually can bias it towards initial values
sensitive to outliers
◉ Grubbs Outlier Test. Answer: A formula that uses an outlier's
value, the mean of the data, and the standard deviation to determine
whether or not the data point is within the confidence interval for a
normal distribution or should be thrown out
◉ CUSUM. Answer: Change detection model that keeps a running
total of the amount that observations vary above the expected value.
The running total exceeds a preset threshold value, it indicates there
has been a change.
◉ CUSUM Pros/Cons. Answer: Pros: Best way to detect the small
shifts of process mean especially 0.5 to 2 SD from the target mean
Easy to identify visually the shifts in process mean
Cons: Cumbersome to establish and maintain
Tough to interpret the patterns.
Choosing C and T values is a pro and con as it can cause bias but
creates more flexibility
2026 COMPLETE QUESTIONS AND ANSWERS
GRADED A+
◉ SVM Pros/Cons. Answer: Pros: It works really well with a clear
margin of separation
It is effective in high dimensional spaces.
It is effective in cases where the number of dimensions is greater
than the number of samples.
It uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.
Cons: Not good for very large data sets
Not good for when the data set has more noise i.e. target classes are
overlapping
Doesn't directly provide probability estimates.
◉ K-nearest neighbor (K-NN). Answer: An unsupervised
classification algorithm. Looks at the X number of closest points to
the new one and classifies as whichever is most common.
◉ K-nearest neighbor (K-NN) Pros/Cons. Answer: Pros: No
assumptions about data
Easy to understand/Interpret
,Varsatile
Cons: Computationally expensive because algorithm stores all
training data
Sensitive to irrelevant features and scale of data
◉ k-fold cross validation. Answer: Validation Technique where data
is divided into X number of data subsets. Each subset is then used as
a for testing while the rest are used for training. The algorithm then
rotates through each subset and averages the results
◉ K Fold cross Validation Pros/Cons. Answer: Pros: Validates
Performance of model
Can create balance across predicted features classes
Cons: Doesn't work well with time series data
The aggregate scores of your model could miss some important
extreme values or overpower them so theyre harder to pick up on
◉ k-means clustering. Answer: Unsupervised learning heuristic that
sets x starts by assigning x number of cluster centers, then clusters
all data points into each of them based on distance. The center point
of each cluster is then calculated and all data points are again re
clustered. Repeat process until no-data points change clusters. Ideal
number of clusters can be identified via elbow diagram.
, ◉ k-means pros and cons. Answer: Pros: Simple to implement
Scales well to large data sets
Easily adaptable
Cons: Choosing K manually can bias it towards initial values
sensitive to outliers
◉ Grubbs Outlier Test. Answer: A formula that uses an outlier's
value, the mean of the data, and the standard deviation to determine
whether or not the data point is within the confidence interval for a
normal distribution or should be thrown out
◉ CUSUM. Answer: Change detection model that keeps a running
total of the amount that observations vary above the expected value.
The running total exceeds a preset threshold value, it indicates there
has been a change.
◉ CUSUM Pros/Cons. Answer: Pros: Best way to detect the small
shifts of process mean especially 0.5 to 2 SD from the target mean
Easy to identify visually the shifts in process mean
Cons: Cumbersome to establish and maintain
Tough to interpret the patterns.
Choosing C and T values is a pro and con as it can cause bias but
creates more flexibility