ANSWERS RESEARCHED AND PROVIDED. 2024/2025
UPDATE
1. What is type 1 error and type 2 error? Falsely concluding that
intervention was successful. Known as false positive result
Falsely concluding intervention was not successful. Known a false
negative
2. What can we do about overfitting? > Regularization (penalizing
model complexity while we're training)
> L2 regularization penalizes really big weights - complexity(model) =
sum of squares of weights
> Regularization is about instead of minimizing only loss, its
minimizing loss + complexity which is called structural risk
minimization
3. Describe true positive, false positive, false negative, true negative:
True
Positives - we correctly called wolf; the town is saved.
> False positive - we called wolf falsely, the town is mad
> False negative - There was a wolf but we didn't spot it. Chickens
are eaten.
> True negative - no wolf, no alarm. All is well.
4. What is precision? True Positive / (True Positive + False Positive)
When you classify something as positive, how often are you right?
5. What is recall? True positive / (True positive + False Negative)
When you classify something as positive, how many times did you
fail to recall something as actually positive?
6. What is an ROC curve? A graph showing the performance of a
classification model at all classification thresholds. The curve plots
two parameters true positive rate (recall) & true negative rate,
also called Specificity (true negative / (true negative + false
positive)) along the axis from 0 to 1
,COMMONLY ASKED QUESTIONS FOR DATA SCIENCE.
ANSWERS RESEARCHED AND PROVIDED. 2024/2025
UPDATE
i.e. T PR on the y axis, and FPR on the x axis
7. What is false positive rate? (false positive / (false positive + true
negative))
8. What is the bias? An error from erroneous assumptions in the
learning algorithm. High bias can cause an algorithm to miss the relevant
relations between features and target outputs (underfitting).
The effect on the model because the sample systematically
misrepresents the 'real' data. Most datasets are a convenience
sample - the data easiest to collect
9. What is variance? An error from sensitivity to small fluctuations in
the training set. High variance can cause an algorithm to model the
random noise in the training data, rather than the intended outputs
(overfitting).
The effect on the model because it was built from this sample rather
than that sample
variance measures how inconsistent are the predictions from one
another
10. What is skewness? Asymmetry in a statistical distribution, in which
the curve appears distorted or skewed either to the left or to the right.
Skewness can be quantified to define the extent to which a distribution
differs from a normal distribution.
This is called negative skewness (tail goes towards negative
11 . What is kurtosis?: The sharpness of the peak of a frequency-
distribution curve.
, COMMONLY ASKED QUESTIONS FOR DATA SCIENCE.
ANSWERS RESEARCHED AND PROVIDED. 2024/2025
UPDATE
12. What are the different ways to handle missing values?
1 . Delete the entire row/column
2 Replace by a fixed value (i.e. "unknown")
3 General statistic replacement (replace values by a statistic
associated with a particular column like mean or median)
4 Grouped statistic replacement (replace values by a statistic
associated with a
particular group)
5 Imputation - predict values based on nearest neighbors or
likelihood
13. What kind of feature transformation can you perform on
numeric?
1. Round numeric to the nearest decimal or you can turn it into
discrete for turning it into a categorical later
2. Discretization: binning of a variable to become categorical for
better value management
3. Scaling (change the sale of the variable for better
understanding), i.e. min-max, z-score, etc.
14. What are some types of discretization methods? 1. Equal-width
binning (bins have equal ranges, roughly same distribution as
original variable
2. equal-density (frequency) binning - bins have equal number of
examples/records/rows with a uniform distribution
15. What are the 5 categories of feature generation? 1. Indicator
features (Attributes that isolate key information)