Factor Based Models
- classification, clustering, regression. Implicitly assumed that we have a lot of factors in the
final model
Why limit number of factors in a model? 2 reasons
- overfitting: when # of factors is close to or larger than # of data points. Model may fit too
closely to random effects
- simplicity: simple models are usually better
Classical variable selection approaches
- 1. Forward selection
2. Backwards elimination
3. Stepwise regression
greedy algorithms
Backward elimination
- variable selection; classical
Opposite of forward selection. Start with model with all factors, at each step find worst factor
and remove from model. Continue until no more to add, # of factor threshold is satisfied.
Remove factors at the end that were not good enough
Forward selection
- variable selection; classical
Start with model with no factors, at each step find best new factor to add. Continue until
none bad enough to remove, # of factor threshold is satisfied. Remove factors at the end that
were not good enough
Stepwise regression
- variable selection; classical
,Combination of forward selection and backwards elimination. Start with all or no factors.
Each step remove/add a factor. As it continues, after adding in new factor we eliminate right
away any factors that may be good. Helps model adjust when new factors are added,
goodness values change
Ways of determining if factors are good enough in variable selection
- p-value, Rsquared, AIC, BIC
Greedy algorithm
- At each step, it does the one thing that looks best
without taking future options into consideration. Good for initial analysis
1. Forward selection
2. Backwards elimination
3. Stepwise regression
Global variable selection approaches
- 1. LASSO
2. Elastic Net
Slower, but tend to give better predictive models
LASSO
- variable selection; global
- SCALE the date (as with any constrained sum of coefficients)
- add a constraint to the standard regression equation
- minimize sum of squared errors
- T = limit or "budget" on how large the sum of squared errors can get. Budget will be used on
most important coefficients
- Method for limiting the number of variables in a model by limiting the sum of all
coefficients' absolute values. Can be very helpful when number of data points is less than
number of factors.
, Elastic Net
- variable selection; global
- SCALE the date (as with any constrained sum of coefficients)
- T = limit or "budget" on how large the sum of squared errors can get. Budget will be used on
most important coefficients
- Combination of lasso and ridge regression.
- Variable selection benefits of LASSO
- Predictive benefits of ridge regression
Ridge Regression
- Method of regularization by limiting the sum of the squares of the coefficients. Will reduce
the magnitude of coefficients, not the number of variables chosen.
- The quadratic term in ridge regression
tends to shrink the coefficient values i.e Whatever the basic regression model coefficients
would be,
the quadratic constraint pushes them toward zero
or regularizes them.
Design of Experiments (DOE)
- How can we still have a representative sample of each combination of factors, while only
surveying 600 people?
How to determine which of the several factors are most
important to predicting someone's answers?
comparison to measure difference
control for other factors and effects
blocking factors that account for the variation between factors (red sports car vs red minivan
example)
A/B testing