What is validation? - The process of checking to see how "good" a model is
Any data has two types of patterns - name and define them. - Real effects - real relationship between
attributes and response. Random effects - random, but looks like a real effect.
If we use the same data to fit a model as we do to estimate how good it is, what is likely to happen? -
The model will appear to be better than it really is.
What are training, validation, and test sets used for? - Training set is used to fit the model, validation
set is used to choose the best model, and test data set is used to estimate performance of the
chosen model.
Why can't we use a chosen model's performance on the validation set to measure its quality? -
Because the model that does best on the validation set is more likely to have benefitted from lucky
randomness
What's the general rule of thumb for how much data goes into training and testing sets (assuming
only one model, so no validation)? - 70-90% training, 10-30% test
What's the rule of thumb for how much data goes into training, validation, and test sets? - 50-70%
training, split the rest equally between validation and test
What are the 2 general approaches to splitting data? - random selection and rotation
What is the benefit of using rotation over random selection when splitting data? - Rotation separates
the data equally, whereas randomness could give one set more early or late data
What is the drawback of rotation in comparison to random selection? - Rotation may introduce bias