Update ) Intro to Analytics
Modeling | Questions & Answers | 100%
Correct | Grade A. Georgia Tech
1. Factor Based Models: classification, clustering, regression. Implicitly
assumed that we have a lot of factors in the final model
2. Why limit number of factors in a model? 2 reasons: overfitting: when #
of factors is close to or larger than # of data points. Model may fit too
closely to random effects
simplicity: simple models are usually better
3. Classical variable selection approaches: 1. Forward selection
2. Backwards
elimination 3.
Stepwise
regression
,greedy
algorithms
4. Backward elimination: variable selection; classical
Opposite of forward selection. Start with model with all factors, at each step
find worst factor and remove from model. Continue until no more to add, # of
factor threshold is satisfied. Remove factors at the end that were not good
enough
5. Forward selection: variable selection; classical
Start with model with no factors, at each step find best new factor to add.
Continue until none bad enough to remove, # of factor threshold is satisfied.
Remove factors at the end that were not good enough
6. Stepwise regression: variable selection; classical
Combination of forward selection and backwards elimination. Start with all or
no factors. Each step remove/add a factor. As it continues, after adding in new
factor we eliminate right away any factors that may be good. Helps model adjust
when new factors are added, goodness values change
, 7. Ways of determining if factors are good enough in variable selection: p-
value, Rsquared, AIC, BIC
8. Greedy algorithm: At each step, it does the one thing that looks best
without taking future options into consideration. Good for initial analysis
1. Forward selection
2. Backwards elimination
3. Stepwise regression
9. Global variable selection approaches: 1. LASSO
2. Elastic Net
Slower, but tend to give better predictive models
10. LASSO: variable selection; global
- SCALE the date (as with any constrained sum of coefficients)
- add a constraint to the standard regression equation
- minimize sum of squared errors
- T = limit or "budget" on how large the sum of squared errors can get. Budget
will be used on most important coefficients