Questions and Answers.
Data - Answer Facts and statistics collected together for reference or analysis
Generalized linear model (GLM) - Answer a statistical technique that increases the flexibility
of a linear model by linking it with a nonlinear function
principle component analysis - Answer A type of factor analysis used to identify the most
independent variables and their relative strength/position.
Factor Variable vs keeping it numeric - Answer If the different "categories" have no sense of
order then it's obviously a factor. The second question is wether there is a potentially
compelling reason why the differences in the numerical values are meaningful. (difference in
hours could be useful in predicting y throughout the day but fall coming after spring may not be
helpful)
what is an interaction and how do you justify it - Answer an interaction is when the
dependency of the target variable on a predictor is itself dependent on a third variable (i.e.
using both predictors creates a result that wouldn't be expected based on just them both
independently.)
using boxplots and descriptive statistics to show differences in the mean and median overall and
when the interaction occurs
k-means clustering - Answer unsupervised learning with the goal of assigning records to the
group (one of the k created) that it is most similar to. k is specified at the beginning and random
observations are assigned as the starting centroids of each cluster. From there all observations
are assigned to a cluster, new centroids are calculated. The process of assigning observations to
a cluster and recalculating centroids is repeated until the assignments are stable (the
Correlation Matrix (Rcode) - Answer cor(dataframe[, sapply(dataframe, is.numeric)]). which
passes all rows and the numeric columns into the cor function
Density Charts (Rcode) - Answer # Density charts for numeric variables
data.all %>%
select_if(is.numeric) %>%
gather() %>% # Make key value pairs that allows the use of facet_wrap
ggplot(aes(value)) +
facet_wrap(~key, scales = "free") +
, geom_density()
creating a table (rcode) - Answer table(dataframe$desiredrow, dataframe$desiredcolumn
get descriptive stats from a df subset (rcode) - Answer data.all %>%
filter(numeric >= 5 & numeric <= 9) %>%
group_by_("factor") %>%
summarise(
mean = mean(bikes_per_hour),
median = median(bikes_per_hour),
n = n()
)
Describe and interpret a K means elbow plot - Answer "In an elbow plot the proportion of
variance explained by the variance between the clusters is calculated and plotted for sucessive
values of k. increases in k generally lead to diminishing increases in the PVE until it creates an
"elbow" where the proportion of variance explained is reduced by the addition of a new cluster.
The cluster just before this reduction is the one that is generally considered a good choice.
when should and shouldn't you use the cluster assignment replace OG variables - Answer
when the intracluter dissimilarity is relatively low and the groups are separate from other
groups. if most of your clusters are just partitions of a big blob of data then it's probably not a
good idea to replace. If they appear to be finding clumps then it may be valuable. Finally you
may be reducing dimensions/eliminating a continuous relationship which can be good and bad
Describe Bias - Answer Expected loss caused by model not being complex enough to capture
the signal in the data
Describe Variance - Answer Expected loss caused by the model being too complex and
overfitting the data
Describe the bias variance trade off - Answer Total loss of a model is bias + variance +
unavoidable error with variance increasing with model complexity/additional predictors and
bias forming a parabola centered at the point where the model is complex enough to accurately
portray the signal in the data but not so complex that it overfits the model. That said as a
general statement we think of bias and variance as a trade off whereby reducing variance you
gain more bias and vice versa. When comparing different models their performance on the test
set is the best way to see