Other

ISYE 7406 Homework 3 | Verified study set complete Solutions | A+ Graded | 2026 Updates | 100% correct

Rating

Sold

Pages

Uploaded on

24-04-2026

Written in

2025/2026

ISYE 7406 Homework 3 | Verified study set complete Solutions | A+ Graded | 2026 Updates | 100% correct

Institution

Course

Content preview

ISYE 7406 HW3
1. Introduction
Accurate prediction of fuel efficiency plays an important role in automobile design, regulatory
compliance, and consumer decision-making. Vehicles with higher gas mileage reduce fuel costs
and environmental impact, making the identification of key efficiency-related factors both
economically and practically meaningful. This project investigates statistical learning methods for
predicting whether a vehicle achieves high or low fuel efficiency based on measurable engine and
design characteristics.

The analysis is based on the Auto MPG dataset from the UCI Machine Learning Repository, which
contains 392 vehicles described by variables such as cylinders, displacement, horsepower, weight,
acceleration, model year, and origin. To formulate the problem as a classification task, the
continuous miles-per-gallon (mpg) variable is transformed into a binary response variable,
“mpg01”, indicating whether a vehicle’s fuel efficiency is above or below the median level.

Several predictors in this dataset exhibit moderate to strong correlation, particularly engine-
related variables such as displacement, horsepower, and weight. Such relationships may
influence model stability and predictive performance, especially for distance-based methods.
Accordingly, multiple classification approaches are considered, including Linear Discriminant
Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes, Logistic Regression, K-
Nearest Neighbors (KNN) and Principal Component Analysis (PCA) with KNN (PCA-KNN).

To obtain reliable performance estimates, repeated random train–test splits (Monte Carlo cross-
validation) were conducted, and paired t-tests were applied to assess differences in predictive
accuracy and mitigate dependence on any single data partition. The remainder of this report
presents the exploratory data analysis, outlines the modeling procedures, compares classification
results across methods, and concludes with practical considerations for the design and selection
of fuel-efficient vehicles.

2. Exploratory Data Analysis
The cleaned dataset (“Auto1”) contains 392 observations and includes the binary response
variable “mpg01” along with seven numerical predictors describing engine characteristics and
vehicle design. Since the objective is to identify factors associated with high versus low fuel
efficiency, exploratory analysis focuses on understanding the relationship between “mpg01” and
each explanatory variable through scatterplots, boxplots, and correlation analysis.

2.1 Scatter Plots
The Scatter Plot Matrix (Figure 1) illustrates the relationship between vehicle specifications and
fuel efficiency (mpg). A pronounced non-linear, negative correlation exists between mpg and
engine-related metrics, specifically displacement, horsepower, and weight. As these values
increase, fuel economy decreases at a decaying rate. In contrast, acceleration and model year
exhibit moderate positive trends, indicating that newer vehicles and those with slower
acceleration profiles (often associated with smaller engines) tend to achieve higher fuel efficiency.

-1-

, Figure 1. Scatter Plots Matrix

2.2 Boxplots
The boxplots (Figure 2) evaluate the distribution of vehicle attributes partitioned by the binary
variable mpg01. High-efficiency vehicles (category 1) are characterized by significantly lower
medians for cylinders, displacement, horsepower, and weight compared to their low-efficiency
counterparts. The year distribution reveals that the high-efficiency cohort is skewed toward later
production dates, while the origin plot suggests a higher concentration of efficient vehicles within
specific manufacturing regions.
Figure 2. Boxplots

-2-

,2.3 Correlation Analysis
The correlation matrix (Table 1) quantifies the strength of linear associations across all variables.
“Mpg01” demonstrates its strongest negative associations with cylinders (-0.76), and
displacement (-0.75), and weight (-0.76). These figures identify engine size and vehicle mass as
the primary determinants for classifying fuel efficiency. Conversely, the high internal correlation
between displacement and cylinders (0.95) confirms that these features are nearly
interchangeable, suggesting a need for dimensionality reduction or careful feature selection in
the modeling phase.
Table 1. The correlation matrix
Mpg01 Cylinders Displacement Horsepower Weight Acceleration Year Origin
Mpg01 1.00 -0.76 -0.75 -0.67 -0.76 0.35 0.43 0.51
Cylinders -0.76 1.00 0.95 0.84 0.90 -0.50 -0.35 -0.57
Displacement -0.75 0.95 1.00 0.90 0.93 -0.54 -0.37 -0.61
Horsepower -0.67 0.84 0.90 1.00 0.86 -0.69 -0.42 -0.46
Weight -0.76 0.90 0.93 0.86 1.00 -0.42 -0.31 -0.59
Acceleration 0.35 -0.50 -0.54 -0.69 -0.42 1.00 0.29 0.21
Year 0.43 -0.35 -0.37 -0.42 -0.31 0.29 1.00 0.18
Origin 0.51 -0.57 -0.61 -0.46 -0.59 0.21 0.18 1.00

2.4 Pairs Plot
The Pairs Plot (Figure 3) provides a synthesized view of the dataset’s internal structure,
combining univariate density estimates with bivariate correlation coefficients. The diagonal
density plots reveal a notable right-skew in displacement and horsepower. The upper triangle
highlights severe multicollinearity among the independent variables, particularly between
cylinders, displacement, and weight, where correlation coefficients exceed 0.90. These high
values suggest that these variables convey redundant information, which may impact the stability
of subsequent multivariate models.
Figure 3. Pairs Plot

-3-

, 2.5 Key Determinants of Vehicle Efficiency
Overall, the exploratory analysis identifies cylinders, displacement, horsepower, weight, and
origin as the primary determinants of fuel efficiency. A robust inverse relationship exists between
engine scale and economy; specifically, increases in weight and displacement correspond to a
precipitous decline in mpg. Data distributions confirm that high-efficiency vehicles are
characterized by lower cylinder counts and reduced mass, whereas low-efficiency vehicles are
concentrated in the higher tiers of displacement and horsepower.

Beyond mechanical specifications, the origin variable serves as a significant categorical predictor,
likely reflecting the impact of regional manufacturing standards. However, the analysis reveals
substantial multicollinearity between cylinders, displacement, and weight, with correlation
coefficients exceeding 0.90. While these five variables are essential for understanding the data
structure, their high degree of redundancy suggests that a refined approach to feature selection
is necessary to ensure the statistical stability of subsequent classification models.

3. Methodology
This study evaluates the predictive performance of several classification algorithms to determine
whether a vehicle achieves high or low gas mileage, represented by the binary variable “mpg01”.
Based on the exploratory data analysis, only the predictors demonstrating the strongest
association with the response variable are included in the modeling process to ensure
computational efficiency and reduce noise.

The “Auto2” dataset is partitioned into a training set and an independent test set. To ensure a
systematic and reproducible split, approximately 10% of the observations (n 1  n/10) are
randomly sampled to form the testing data, while the remaining 90% are used for model training.
A fixed seed (set.seed(123)) is utilized during the sampling process to ensure results remain
consistent across different iterations. The response variable, “mpg01”, is explicitly converted into
a factor within the training set to facilitate classification modeling in R. This split is considered
reasonable as it provides a robust training base for the algorithms to learn underlying patterns
while maintaining a separate hold-out set to calculate a reliable estimate of the test error rate.

Six distinct classification strategies are implemented and compared:
3.1 Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis is implemented using the lda() function from the “MASS” package.
The method assumes that the predictors follow a multivariate normal distribution within each
class and share a common covariance matrix.

In this analysis, LDA estimates class-specific means for cylinders, displacement, horsepower,
weight, and origin, while pooling the covariance matrix across the two efficiency groups. The
resulting linear discriminant function defines a linear decision boundary. Class membership for
both training and test observations is determined by selecting the class with the highest posterior
probability returned by the predict() function.

-4-

Report Copyright Violation

Written for

Institution: Georgia Institute Of Technology
Course: ISYE 7406

All documents for this subject (82)

Document information

Uploaded on: April 24, 2026
Number of pages: 34
Written in: 2025/2026
Type: OTHER
Person: Unknown

Subjects

the cleaned dataset
explanatory variable
scatterplots
parameter estimation

$15.99

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

EduSprint

4.3

(6)

Also available in package deal

Get to know the seller

EduSprint Chamberlain College Of Nursing

View profile

Sold

Member since

2 year

Number of followers

Documents

6810

Last sold

5 days ago

Elite Nursing Exams Hub

WGU A+ Vault fore more info

4.3

6 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller EduSprint. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45940 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

ISYE 7406 Homework 3 | Verified study set complete Solutions | A+ Graded | 2026 Updates | 100% correct

Content preview

Written for

Document information

Subjects

Also available in package deal

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?