ISYE 7406 – Homework 3: MPG
02/17/2026
Introduction
Fuel efficiency remains one of the most important characteristics of modern vehicles, affecting
consumer purchasing decisions, environmental impact, and manufacturing strategy. The Auto
MPG dataset provides historical automobile data including engine characteristics, vehicle
weight, and production year. The objective of this analysis is to classify vehicles as having high
or low gas mileage based on these attributes.
Specifically, we construct a binary response variable, mpg01, equal to 1 if a vehicle’s mpg
exceeds the median and 0 otherwise. Multiple classification methods are applied and compared
using cross-validation to determine which method performs best and to understand which vehicle
features most strongly influence fuel efficiency.
Exploratory Data Analysis (EDA)
The cleaned dataset contains 392 observations after removing missing values and excluding the
vehicle name column. The correlation matrix indicates strong relationships between mpg01 and
several predictors.
Variable Correlation with mpg01
Cylinders -0.76
Displacement -0.75
Weight -0.76
Horsepower -0.67
Origin 0.51
Year 0.43
Acceleration 0.35
The strongest negative correlations are with weight, cylinders, and displacement, suggesting
that larger and heavier vehicles are substantially more likely to fall into the low-mpg category.
, Figure 1. Boxplot of Weight by mpg01
The boxplot of vehicle weight by mpg01 (Figure 1) shows clear separation between classes.
High-mpg vehicles (mpg01 = 1) are significantly lighter, with noticeably lower median weight
and tighter spread compared to low-mpg vehicles.
Figure 2. Scatterplot (MPG vs Weight)
The scatterplot of mpg versus weight (Figure 2) further confirms a strong nonlinear negative
association: as weight increases, mpg decreases sharply. This relationship visually supports the
large negative correlation observed in the table.
, The correlation matrix reveals that mpg01 is most strongly associated with vehicle weight (-
0.76), cylinders (-0.76), and displacement (-0.75). These large negative correlations indicate that
heavier vehicles with larger engines are substantially more likely to fall below the median mpg
threshold. Horsepower also exhibits a strong negative relationship (-0.67), further reinforcing the
role of engine size and power in determining fuel efficiency. In contrast, origin (0.51) and model
year (0.43) show moderate positive associations, suggesting improvements in efficiency over
time and across manufacturing regions.
These relationships are visually reinforced by Figure 1 and Figure 2. The boxplot of weight by
mpg01 shows pronounced class separation, with high-mpg vehicles exhibiting significantly
lower median weight and less dispersion. The scatterplot of mpg versus weight demonstrates a
clear nonlinear decreasing trend, indicating that fuel efficiency declines rapidly as vehicle mass
increases. Together, these findings justify selecting cylinders, displacement, horsepower, weight,
and origin as predictors for classification.
Methodology
To obtain robust performance estimates, repeated random splitting (Monte Carlo cross-
validation) was performed over 100 iterations. In each iteration, approximately 10% of the data
was randomly selected as a test set, with the remaining observations used for training.
The following classification methods were evaluated:
1. Linear Discriminant Analysis (LDA)
2. Quadratic Discriminant Analysis (QDA)
3. Naive Bayes
4. Logistic Regression
5. K-Nearest Neighbors (KNN, k = 3)
Performance was measured using misclassification error on both training and test sets.
Results
One-Split Model Performance
For a representative 90/10 train-test split, testing errors were:
Method Test Error
LDA 0.128
QDA 0.103
Naive Bayes 0.103
Logistic 0.103
KNN (k=3) 0.103
02/17/2026
Introduction
Fuel efficiency remains one of the most important characteristics of modern vehicles, affecting
consumer purchasing decisions, environmental impact, and manufacturing strategy. The Auto
MPG dataset provides historical automobile data including engine characteristics, vehicle
weight, and production year. The objective of this analysis is to classify vehicles as having high
or low gas mileage based on these attributes.
Specifically, we construct a binary response variable, mpg01, equal to 1 if a vehicle’s mpg
exceeds the median and 0 otherwise. Multiple classification methods are applied and compared
using cross-validation to determine which method performs best and to understand which vehicle
features most strongly influence fuel efficiency.
Exploratory Data Analysis (EDA)
The cleaned dataset contains 392 observations after removing missing values and excluding the
vehicle name column. The correlation matrix indicates strong relationships between mpg01 and
several predictors.
Variable Correlation with mpg01
Cylinders -0.76
Displacement -0.75
Weight -0.76
Horsepower -0.67
Origin 0.51
Year 0.43
Acceleration 0.35
The strongest negative correlations are with weight, cylinders, and displacement, suggesting
that larger and heavier vehicles are substantially more likely to fall into the low-mpg category.
, Figure 1. Boxplot of Weight by mpg01
The boxplot of vehicle weight by mpg01 (Figure 1) shows clear separation between classes.
High-mpg vehicles (mpg01 = 1) are significantly lighter, with noticeably lower median weight
and tighter spread compared to low-mpg vehicles.
Figure 2. Scatterplot (MPG vs Weight)
The scatterplot of mpg versus weight (Figure 2) further confirms a strong nonlinear negative
association: as weight increases, mpg decreases sharply. This relationship visually supports the
large negative correlation observed in the table.
, The correlation matrix reveals that mpg01 is most strongly associated with vehicle weight (-
0.76), cylinders (-0.76), and displacement (-0.75). These large negative correlations indicate that
heavier vehicles with larger engines are substantially more likely to fall below the median mpg
threshold. Horsepower also exhibits a strong negative relationship (-0.67), further reinforcing the
role of engine size and power in determining fuel efficiency. In contrast, origin (0.51) and model
year (0.43) show moderate positive associations, suggesting improvements in efficiency over
time and across manufacturing regions.
These relationships are visually reinforced by Figure 1 and Figure 2. The boxplot of weight by
mpg01 shows pronounced class separation, with high-mpg vehicles exhibiting significantly
lower median weight and less dispersion. The scatterplot of mpg versus weight demonstrates a
clear nonlinear decreasing trend, indicating that fuel efficiency declines rapidly as vehicle mass
increases. Together, these findings justify selecting cylinders, displacement, horsepower, weight,
and origin as predictors for classification.
Methodology
To obtain robust performance estimates, repeated random splitting (Monte Carlo cross-
validation) was performed over 100 iterations. In each iteration, approximately 10% of the data
was randomly selected as a test set, with the remaining observations used for training.
The following classification methods were evaluated:
1. Linear Discriminant Analysis (LDA)
2. Quadratic Discriminant Analysis (QDA)
3. Naive Bayes
4. Logistic Regression
5. K-Nearest Neighbors (KNN, k = 3)
Performance was measured using misclassification error on both training and test sets.
Results
One-Split Model Performance
For a representative 90/10 train-test split, testing errors were:
Method Test Error
LDA 0.128
QDA 0.103
Naive Bayes 0.103
Logistic 0.103
KNN (k=3) 0.103