Total possible points: 85 (with 15 bonus points)
The correct answer for a multiple-choice question is highlighted in bold and yellow.
Multiple linear regression (35 points in total)
1. Multiple-choice questions (5 points in total, each worth 1 point)
1.1 _____ refers to the use of sample data to calculate a range of values that is believed to include the
unknown value of a population parameter.
a. Interval estimation
b. Hypothesis testing
c. Statistical inference
d. Point estimation
1.2 Which of the following inferences can be drawn from the residual scatter chart given below?
a. The residuals have a varying variance.
b. The model captures the relationship
between the variables accurately.
c. The regression model follows the F
probability distribution.
d. The residual distribution is consistently
scattered about zero.
Explanation: note this is a residual plot against x. The
dots distributed unevenly around the zero horizontal line
across different values of x, which means that the
variances vary across different values of x.
1.3 The _____ is an indication of how frequently interval estimates based on samples of the same size
taken from the same population using identical sampling techniques will contain the true value of the
parameter we are estimating.
a. residual
b. tolerance factor
c. confidence level
d. accuracy level
1.4 _____ refers to the degree of correlation among independent variables in a regression model.
a. Multicollinearity
b. Tolerance
c. Rank
d. Confidence level
1.5 Assessing the regression model on data other than the sample data that was used to generate the
model is known as _____.
a. approximation
b. cross-validation
c. graphical validation
d. postulation
Explanation: In cross-validation, we partition a dataset into the training set and the validation set to train
and validate the model respectively.
Page 1 of 10
, Midterm Exam Solution - BUSAD 137
2. The Butler Trucking Company wants to develop a multiple linear regression model to describe and
predict a trucker’s travel time using a set of independent variables. The company has the sample
dataset that contains 300 different travel assignments with four variables (Miles, Gasoline
Consumption, and Deliveries, and Time). A data analyst at the company is asked to do some analyses
based on the data. (30 points in total)
As a first attempt, the data analyst developed the following model, ran the regression in Excel with a
confidence level of 95%, and got the report.
y = β0 + β1 x1 + β2 x2 + β3 x3 + Ꜫ
while y = travel time (Time), x1 = distance traveled in miles (Miles), x2 = gasoline consumed in gallons
(Gasoline Consumption), and x3 = number of deliveries (Deliveries).
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.904654545
R Square 0.818399845
Adjusted R Square 0.816559303
Standard Error 0.828974689
Observations 300
ANOVA
df SS MS F Significance F
Regression 3 916.6922855 305.5640952 444.6515194 2.7841E-109
Residual 296 203.4109145 0.687199035
Total 299 1120.1032
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.237718797 0.221644508 1.072522841 0.284358889 -0.198479972 0.673917567
Miles 0.077781546 0.008464959 9.188649826 7.21336E-18 0.061122415 0.094440676
Gasoline Consumption -0.118495153 0.090572642 -1.308288567 0.191790445 -0.296743085 0.059752779
Deliveries 0.690925596 0.029494272 23.42575518 2.25425E-69 0.632880552 0.74897064
2.1 Is the entire model statistically significant or not according to the report? What statistical test was
used for this? From which number in which table in the report can you draw this conclusion, and why? (6
points)
Yes, the entire model is highly statistically significant according to the F-test’s result and the
corresponding close-to-zero (2.78E-109) p-value in the second (ANOVA) table of the regression report,
compared with the 5% significance level (as the confidence level was 95% when running this regression).
2.2 How much variability in percentage in the sample data is captured/explained by the model?
From which number in which table can you draw this conclusion? (4 points)
The R square (coefficient of determination) is used to measure the variability captured by the model in
the first table of the report. It shows that approximately 82% variability has been captured by the model,
which is pretty decent.
Page 2 of 10