1. Why do we split data into training/validation/test?: To make sure the results are generalizable. We don't want to
overfit to the training sample.
2. How should you split into training/validation/test?: You can really split it however you want, but some
suggested methods are:
50-40-10 split for lots of data
70-20-10 split on smaller data sets
3. Do you have to split the data for unsupervised techniques?: Nope! Use whole data set for unsupervised learning
(like clustering)
4. How do you know when you don't have enough data to split it? What approach should you then take?: It is
good to have at least 10 observations per variable
Use cross-validation (with test set) if you don't feel that you have enough data to do a normal split
5. Should you ever report accuracy stats on the training data set to a client?-
: NO - report on test ideally, validation if you must
6. Steps for model creation: 1. Use training data to build model
2. Evaluate/tune models on validation data (but don't train model on validation, only use it to fine tune)
3. Once a final model is chosen, re-run this model on the training and validation TOGETHER to finalize
parameters.
4. Use this model on the test data set. Report this accuracy statistic!
5. If model is going to be deployed, use ALL data to update to final parameters.
7. What is k-fold cross-validation?: Divide data into k equally sized samples. Then, for each fold, train the model on
all data except one-fold that is left out as the validation data. Record accuracy measures for each individual fold left out,
and at the end take an average/stdev of all accuracy stats. Use the summary to choose a model.
8. Can you use cross-validation as a splitting technique in any situation?: Yes - can always be used. However, it is
most commonly used when there are not sufficient observations to break into training/validation/test
9. What is Jackknifing?: Leave one out cross-validation n-fold cross validation
where n = sample size
use only one observation as the validation set, and repeat for each observation in the data set
**can be really time consuming, use for only really small data sets
10.What is bootstrapping? Why is it used?: A non-parametric procedure that can estimate the standard error,
compute confidence intervals, or perform hypothesis tests on a statistic.
The data is assumed to be the population, and sampling WITH replacement is used
, Data Mining - Fall II
to create samples of the same size. A distribution can be made with many of these samples.
It is used when the data set does not meet assumptions (like a normal distribution) and you cannot use other techniques.
11.What are the assumptions of bootstrapping?: Independent observations and the samples are representative of the
population.
12. What are two applications of bootstrapping that were examples in class?-
: 1. Finding variability in the median (or any statistic)
2. Test if two medians (or other stats) are significantly different
13.If you do a lot of hypothesis testing, what happens to your Type I error rates?: They become inflated, so
you need to adjust the p-values
14.If you do a lot of hypothesis testing, what happens? What is one technique you can use to control the family-
wise error rate? How does it work?: lots of hypothesis testing -> inflated Type I error (rejecting the null when it is
actually true)
You can use the Bonferroni technique - it controls family wise error rate. You can multiply the p-values by the number of
tests you ran (inflates p-vals so they are less likely to be rejected)
15.What is transactional data?: Each row represents a transaction, so the data is very long (one person can have a lot
of rows of different purchases/visits/etc.) Want to roll up the data so it has one row per transaction modeled.
Transform long -> wide
16.What is feature creation in the context of transactional data?: Create fea- tures (columns) for the data that you
can pull from the transactions that matters. You need to THINK ABOUT THIS, it will be different in each context.
EX of things that might be important:
- date of first/last transaction
- total amount of transactions
- max/min/avg cost of purchases
17.What are some different approaches to handling missing values?: - Create a flag variable that indicates whether
a value is missing
- For continuous variables, if you want to keep the variable (over 50% is there) you can impute values and add a flag
variable or bin the variable and add a missing bin.
- for categorical variables, create a missing bin
18.What are 3 ways to bin numeric variables?: Equal width - each bin has same range of the variable value but
different numbers of observations within each bin (EX: 0-10, 11-20, 21-30...)
Equal depth - each bin has the same number of observations (take percentiles of the population and bin them)