The holdout dataset is used to:
a. train the algorithms
b. validate the algorithms
c. to make sure your validation dataset is representative of all of your
data
d. not used unless you have dirty data in your dataset - ANSWERS-c.
to make sure your validation dataset is representative of all of your
data
k-fold cross validation - ANSWERS-partition the data set (less the
holdout) into k equal subsets. each subset is called a fold
Why is cross validation better than validation? - ANSWERS-you can
use all of the data to train your model and allows you to use all of
your data to validate your data off of (increases accuracy)
Cross validation score - ANSWERS-average of the five validation
scores
Speed vs Accuracy answers the question: - ANSWERS-How rapidly
(how many nanoseconds) will the model evaluate new cases after
being put into production?
,Where would you find the most accurate and fastest model on the
speed vs accuracy graph? - ANSWERS-close to the origin
when evaluating ML results, i should always choose the fastest
model: T or F - ANSWERS-false
Whether speed or accuracy is more important depends on what we are
using the model for: T or F - ANSWERS-True
Learning Curves - ANSWERS-shows how the models predictive
ability changes with sample sizes
learning curves answers the question: - ANSWERS-will more data
help my model? (make it more predictive)
If a point on the Learning curves graph is lower that means... -
ANSWERS-its more predictive
If a there is a steep line in the learning curves graph that means... -
ANSWERS-adding more data will make it more predictive
three types of feature relationships - ANSWERS-importance, impact,
effects
importance (the green bar) - ANSWERS-the overall impact of a
feature without consideration of the impact of other features
, feature impact (models, understand, feature impact, enable feature
impact) - ANSWERS-the overall impact of a feature adjusted for the
impact of other features
feature effects - ANSWERS-feature impact for a specific feature
value (you can see that data robot struggles when looking at the
distance between predicted and actual)
if a feature is listed as 100% impactful it means that it explains all of
the variation in values of the target: yes or no - ANSWERS-no
your holdout score is good if... - ANSWERS-it is close or similar to
your cross validation score
what are the 4 V's of data? - ANSWERS-Velocity, veracity, volume,
variety
Velocity - ANSWERS-the analysis of streaming data as it travels
around the internet
veracity - ANSWERS-the uncertainty of data, including biases, noise,
and abnormalities; the untrustworthiness of data
volume - ANSWERS-the scale of data