Q8.1
Describe a situation or problem from your job, everyday life, current events, etc., for which a linear
regression model would be appropriate. List some (up to 5) predictors that you might use.
Solution 8.1
I would use linear regression model to forecast how much money I should allocate for gas using my
car for a long trip. I would use the record of weekly mileage of my car as the predictor and weekly
money I spend on gas as the response. The linear model should be like following.
‘Money spend on gas’ = a_0 + a_1 * ‘mileage’, where a_1 is the coefficient & a_0 is the intercept.
Q8.2
Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at
http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to
predict the observed crime rate in a city with the following data: M = 14.0 So = 0 Ed = 10.0 Po1 = 12.0
Po2 = 15.5 LF = 0.640 M.F = 94.0 Pop = 150 NW = 1.1 U1 = 0.120 U2 = 3.6 Wealth = 3200
Solution 8.2
The uscrume.txt file has 15 predictors (a.k.a factor) and 47 data points. lm() is used to generated
linear regression models with different combinations of predictors.
Below is the summary of the lm models
Model | R^2 | Adj R^2 | formula
===== | ====== | ======= | =======
model1 | 0.803 | 0.708 | (all predictors)
model2 | 0.766 | 0.731 | M + Ed + Po1 + U2 + Ineq + Prob
model3 | 0.7 | 0.672 | M + Ed + Po1 + Ineq
model4 | 0.666 | 0.642 | Ed + Po1 + Ineq
model5 | 0.738 | 0.706 | M + Ed + Po1 + Ineq + Prob
model6 | 0.73 | 0.697 | M + Ed + Po1 + U2 + Ineq
Since each model uses different predictor formula, we should use Adj. R^2 value, instead of R^2, for
model quality comparison.
‘model1’, which uses all predictors in the modeling generation, has Adj.R^2 value of 0.708. Not all of
the predictors are significant. Those significant predictors of ‘model1’ are used to generate the rest of
the lm models.
‘model2’’s predictors are M, Ed, Po1, U2, Ineq and Prob. It has the best Adj.R^2 (0.731) among all
models. And, it output shows all predictors are significant (p <0.05)
, Sample output of ‘model1’
Call:
lm(formula = Crime ~ ., data = data)
Residuals:
Min 1Q Median 3Q Max
-395.74 -98.09 -6.69 112.99 512.67
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
M 8.783e+01 4.171e+01 2.106 0.043443 *
So -3.803e+00 1.488e+02 -0.026 0.979765
Ed 1.883e+02 6.209e+01 3.033 0.004861 **
Po1 1.928e+02 1.061e+02 1.817 0.078892 .
Po2 -1.094e+02 1.175e+02 -0.931 0.358830
LF -6.638e+02 1.470e+03 -0.452 0.654654
M.F 1.741e+01 2.035e+01 0.855 0.398995
Pop -7.330e-01 1.290e+00 -0.568 0.573845
NW 4.204e+00 6.481e+00 0.649 0.521279
U1 -5.827e+03 4.210e+03 -1.384 0.176238
U2 1.678e+02 8.234e+01 2.038 0.050161 .
Wealth 9.617e-02 1.037e-01 0.928 0.360754
Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
Time -3.479e+00 7.165e+00 -0.486 0.630708
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 209.1 on 31 degrees of freedom
Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
Sample output of ‘model2’
Call:
lm(formula = Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = data)
Residuals:
Min 1Q Median 3Q Max
-470.68 -78.41 -19.68 133.12 556.23
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5040.50 899.84 -5.602 1.72e-06 ***
M 105.02 33.30 3.154 0.00305 **
Ed 196.47 44.75 4.390 8.07e-05 ***
Po1 115.02 13.75 8.363 2.56e-10 ***
U2 89.37 40.91 2.185 0.03483 *
Ineq 67.65 13.94 4.855 1.88e-05 ***
Prob -3801.84 1528.10 -2.488 0.01711 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 200.7 on 40 degrees of freedom
Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11