2024 (GRADED A)
Describe the steps for developing a stratified sample - ✔✔1) Identify the
strata (the strata are defined by each commination of variables)
2) Draw a random sample from each stratum. Each sample size should be
the same proportion of the total number of records in the stratum to ensure
representativeness.
3) Combine all these samples to create a stratified sample.
What are advantages and disadvantages of unstructured data -
✔✔Advantage: Unstructured data includes information that cannot be
stored in a tabular format. Can provide insights and qualitative information
that cannot be included in a structured dataset.
Disadvantage: Unstructured data often requires more complex methods to
process for input into a predictive model. It can also be more time-
consuming and resource-intensive to analyze unstructured data.
Describe two similarities and two differences between K-means clustering
and hierarchical clustering. - ✔✔Similarities:
,- Both can be used to generate new features from multiple predictor
variables.
- Both are unsupervised learning techniques that group observations to
show structures and relationships in the data without reference to a target
variable.
Differences:
- K-means clustering requires preselecting k.
- Hierarchical clustering produces nested clusters
- K-means only considers dissimilarity among observations and does not
have a notion of dissimilarity among clusters.
Explain the tradeoff between selecting a value of K=2 and K=4 - ✔✔There
is a tradeoff between the percent of total variance explained by the cluster
vs. the complexity of the clustering model. K=2 explains a lower percentage
of total variance but represents a simpler model.
Issues with too many features in K-means analysis - ✔✔- Interpretability of
the signal may become more complex and less useful for a predictive
model where the features need to be interpreted.
- Outliers in any of the features should be considered. If the distance is too
great, they may be assigned their own cluster.
, - It becomes harder to differentiate between observations that are close
and those that are far apart.
What are the differences between a GLM and linear models on transformed
data. - ✔✔1. The normal linear regression has a log transformation
applied to the response variable, and the GLM does not. The log
transformation is reasonable for a variable that has right-skew. 2. The GLM
has flexibility to select a probability distribution that best fits the shape of
the response variable, whereas the normal linear regression model only
allows for one distribution. 3. In the normal linear model the variance of the
(transformed) response variable is constant while in the GLM the variance
can be a function of the mean.
How to identify heteroscedasticity - ✔✔Residuals vs. Fitted plot. Do the
mean of the residuals have an increasing trend as the prediction increases.
(funnel shape)
Interpret the Complexity Parameter table - ✔✔The complexity parameter
determines the threshold of improvement needed to produce an additional
split in the tree. This table lists the impact that changing the complexity
parameter value has on test metrics included the cross-validation error
(xerror). Cross-validation error measures how the model performs on
unseen data, which penalizes both underfit and overfit models.