Question 9.1
Using the same crime data set uscrime.txt as in Question 8.2, apply Principal Component Analysis
and then create a regression model using the first few principal components. Specify your new
model in terms of the original variables (not the principal components), and compare its quality to
that of your solution to Question 8.2. You can use the R function prcomp for PCA. (Note that to
first scale the data, you can include scale. = TRUE to scale as part of the PCA function. Don’t
forget that, to make a prediction for the new city, you’ll need to unscale the coefficients (i.e., do
the scaling calculation in reverse)!)
First, we need to load our data set and then separate the response variable from the
predictor variables. We can then run the prcomp() function to find how many Principal
Components (PC) to use within our regression model based on Standard Deviation
values.
Based on the output, it’s a bit difficult to decide between using 4 or 5 PCs. Plotting the
variances helps in visualizing our decision more clearly. Although there is still some
movement between 4 and 6 PCs, we choose 4 PCs, as the jump to 3 PCs is more
pronounced, and the curve begins to level out at 5 PCs and beyond.
Next, we combine the first 4 PCs with the response variable to create a new data frame
and run the lm() function to create our linear regression model. So far, we have
manipulated the data to run a linear regression model on the first 4 PCs found using the
prcomp() function. We now want to specify our data back in terms of the original
variables in order to both evaluate the quality of fit and compare it to the previous model
in question 8.2, which was also based on the original variables. As shown in the R
This study source was downloaded by 100000900412927 from CourseHero.com on 12-08-2025 11:05:15 GMT -06:00
https://www.coursehero.com/file/250489532/0-OM-HW-4pdf/
, Oscar M.
code, we reverse the regression coefficients from our PCA model to values
corresponding to our original dataset.
From this, we can obtain an R-squared value of 0.309. In comparison to our previous R-
squared value of 0.803, as found in question 8.2, it appears that our PCA model is less
effective at predicting the response variable, accounting for only 31% of the variability.
Although this may seem sufficient, we have not yet validated our model. To do so, we
complete the same 10-fold cross-validation on our new model as we did on our previous
model. This gives us a cross-validated R-squared of 0.047. As seen in question 8.2, our
cross-validated R-squared has decreased significantly, indicating that overfitting is
occurring. Regardless of overfitting, we can determine that our previous Simple Linear
Regression model performs better than our PCA model at predicting the response
variable.
Model R-squared Cross-Validated R-
squared
Linear Regression using PCA (4 PCs) .309 .047
Simple Linear Regression .803 .625
--------------------------------------------------------------------------------------------------------------------
Question 10.1
Using the same crime data set uscrime.txt as in Questions 8.2 and 9.1, find the best model you
can using
(a) a regression tree model
In R, you can use the tree package or the rpart package, and the randomForest package. For each
model, describe one or two qualitative takeaways you get from analyzing the results (i.e., don’t
just stop when you have a good model, but interpret it too).
We begin by loading our data and fitting a regression tree model to predict the response
variable, Crime. By looking at our raw output and a plotted visual, we can see that our
model only uses four of the predictor variables provided. We also see that our model
comprises a total of seven nodes.
This study source was downloaded by 100000900412927 from CourseHero.com on 12-08-2025 11:05:15 GMT -06:00
https://www.coursehero.com/file/250489532/0-OM-HW-4pdf/