ISYE 7406 HW3
1. Introduction
Accurate prediction of fuel efficiency plays an important role in automobile design, regulatory
compliance, and consumer decision-making. Vehicles with higher gas mileage reduce fuel costs
and environmental impact, making the identification of key efficiency-related factors both
economically and practically meaningful. This project investigates statistical learning methods for
predicting whether a vehicle achieves high or low fuel efficiency based on measurable engine and
design characteristics.
The analysis is based on the Auto MPG dataset from the UCI Machine Learning Repository, which
contains 392 vehicles described by variables such as cylinders, displacement, horsepower, weight,
acceleration, model year, and origin. To formulate the problem as a classification task, the
continuous miles-per-gallon (mpg) variable is transformed into a binary response variable,
“mpg01”, indicating whether a vehicle’s fuel efficiency is above or below the median level.
Several predictors in this dataset exhibit moderate to strong correlation, particularly engine-
related variables such as displacement, horsepower, and weight. Such relationships may
influence model stability and predictive performance, especially for distance-based methods.
Accordingly, multiple classification approaches are considered, including Linear Discriminant
Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes, Logistic Regression, K-
Nearest Neighbors (KNN) and Principal Component Analysis (PCA) with KNN (PCA-KNN).
To obtain reliable performance estimates, repeated random train–test splits (Monte Carlo cross-
validation) were conducted, and paired t-tests were applied to assess differences in predictive
accuracy and mitigate dependence on any single data partition. The remainder of this report
presents the exploratory data analysis, outlines the modeling procedures, compares classification
results across methods, and concludes with practical considerations for the design and selection
of fuel-efficient vehicles.
2. Exploratory Data Analysis
The cleaned dataset (“Auto1”) contains 392 observations and includes the binary response
variable “mpg01” along with seven numerical predictors describing engine characteristics and
vehicle design. Since the objective is to identify factors associated with high versus low fuel
efficiency, exploratory analysis focuses on understanding the relationship between “mpg01” and
each explanatory variable through scatterplots, boxplots, and correlation analysis.
2.1 Scatter Plots
The Scatter Plot Matrix (Figure 1) illustrates the relationship between vehicle specifications and
fuel efficiency (mpg). A pronounced non-linear, negative correlation exists between mpg and
engine-related metrics, specifically displacement, horsepower, and weight. As these values
increase, fuel economy decreases at a decaying rate. In contrast, acceleration and model year
exhibit moderate positive trends, indicating that newer vehicles and those with slower
acceleration profiles (often associated with smaller engines) tend to achieve higher fuel efficiency.
-1-
, Figure 1. Scatter Plots Matrix
2.2 Boxplots
The boxplots (Figure 2) evaluate the distribution of vehicle attributes partitioned by the binary
variable mpg01. High-efficiency vehicles (category 1) are characterized by significantly lower
medians for cylinders, displacement, horsepower, and weight compared to their low-efficiency
counterparts. The year distribution reveals that the high-efficiency cohort is skewed toward later
production dates, while the origin plot suggests a higher concentration of efficient vehicles within
specific manufacturing regions.
Figure 2. Boxplots
-2-
,2.3 Correlation Analysis
The correlation matrix (Table 1) quantifies the strength of linear associations across all variables.
“Mpg01” demonstrates its strongest negative associations with cylinders (-0.76), and
displacement (-0.75), and weight (-0.76). These figures identify engine size and vehicle mass as
the primary determinants for classifying fuel efficiency. Conversely, the high internal correlation
between displacement and cylinders (0.95) confirms that these features are nearly
interchangeable, suggesting a need for dimensionality reduction or careful feature selection in
the modeling phase.
Table 1. The correlation matrix
Mpg01 Cylinders Displacement Horsepower Weight Acceleration Year Origin
Mpg01 1.00 -0.76 -0.75 -0.67 -0.76 0.35 0.43 0.51
Cylinders -0.76 1.00 0.95 0.84 0.90 -0.50 -0.35 -0.57
Displacement -0.75 0.95 1.00 0.90 0.93 -0.54 -0.37 -0.61
Horsepower -0.67 0.84 0.90 1.00 0.86 -0.69 -0.42 -0.46
Weight -0.76 0.90 0.93 0.86 1.00 -0.42 -0.31 -0.59
Acceleration 0.35 -0.50 -0.54 -0.69 -0.42 1.00 0.29 0.21
Year 0.43 -0.35 -0.37 -0.42 -0.31 0.29 1.00 0.18
Origin 0.51 -0.57 -0.61 -0.46 -0.59 0.21 0.18 1.00
2.4 Pairs Plot
The Pairs Plot (Figure 3) provides a synthesized view of the dataset’s internal structure,
combining univariate density estimates with bivariate correlation coefficients. The diagonal
density plots reveal a notable right-skew in displacement and horsepower. The upper triangle
highlights severe multicollinearity among the independent variables, particularly between
cylinders, displacement, and weight, where correlation coefficients exceed 0.90. These high
values suggest that these variables convey redundant information, which may impact the stability
of subsequent multivariate models.
Figure 3. Pairs Plot
-3-
, 2.5 Key Determinants of Vehicle Efficiency
Overall, the exploratory analysis identifies cylinders, displacement, horsepower, weight, and
origin as the primary determinants of fuel efficiency. A robust inverse relationship exists between
engine scale and economy; specifically, increases in weight and displacement correspond to a
precipitous decline in mpg. Data distributions confirm that high-efficiency vehicles are
characterized by lower cylinder counts and reduced mass, whereas low-efficiency vehicles are
concentrated in the higher tiers of displacement and horsepower.
Beyond mechanical specifications, the origin variable serves as a significant categorical predictor,
likely reflecting the impact of regional manufacturing standards. However, the analysis reveals
substantial multicollinearity between cylinders, displacement, and weight, with correlation
coefficients exceeding 0.90. While these five variables are essential for understanding the data
structure, their high degree of redundancy suggests that a refined approach to feature selection
is necessary to ensure the statistical stability of subsequent classification models.
3. Methodology
This study evaluates the predictive performance of several classification algorithms to determine
whether a vehicle achieves high or low gas mileage, represented by the binary variable “mpg01”.
Based on the exploratory data analysis, only the predictors demonstrating the strongest
association with the response variable are included in the modeling process to ensure
computational efficiency and reduce noise.
The “Auto2” dataset is partitioned into a training set and an independent test set. To ensure a
systematic and reproducible split, approximately 10% of the observations (n 1 n/10) are
randomly sampled to form the testing data, while the remaining 90% are used for model training.
A fixed seed (set.seed(123)) is utilized during the sampling process to ensure results remain
consistent across different iterations. The response variable, “mpg01”, is explicitly converted into
a factor within the training set to facilitate classification modeling in R. This split is considered
reasonable as it provides a robust training base for the algorithms to learn underlying patterns
while maintaining a separate hold-out set to calculate a reliable estimate of the test error rate.
Six distinct classification strategies are implemented and compared:
3.1 Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis is implemented using the lda() function from the “MASS” package.
The method assumes that the predictors follow a multivariate normal distribution within each
class and share a common covariance matrix.
In this analysis, LDA estimates class-specific means for cylinders, displacement, horsepower,
weight, and origin, while pooling the covariance matrix across the two efficiency groups. The
resulting linear discriminant function defines a linear decision boundary. Class membership for
both training and test observations is determined by selecting the class with the highest posterior
probability returned by the predict() function.
-4-
1. Introduction
Accurate prediction of fuel efficiency plays an important role in automobile design, regulatory
compliance, and consumer decision-making. Vehicles with higher gas mileage reduce fuel costs
and environmental impact, making the identification of key efficiency-related factors both
economically and practically meaningful. This project investigates statistical learning methods for
predicting whether a vehicle achieves high or low fuel efficiency based on measurable engine and
design characteristics.
The analysis is based on the Auto MPG dataset from the UCI Machine Learning Repository, which
contains 392 vehicles described by variables such as cylinders, displacement, horsepower, weight,
acceleration, model year, and origin. To formulate the problem as a classification task, the
continuous miles-per-gallon (mpg) variable is transformed into a binary response variable,
“mpg01”, indicating whether a vehicle’s fuel efficiency is above or below the median level.
Several predictors in this dataset exhibit moderate to strong correlation, particularly engine-
related variables such as displacement, horsepower, and weight. Such relationships may
influence model stability and predictive performance, especially for distance-based methods.
Accordingly, multiple classification approaches are considered, including Linear Discriminant
Analysis (LDA), Quadratic Discriminant Analysis (QDA), Naive Bayes, Logistic Regression, K-
Nearest Neighbors (KNN) and Principal Component Analysis (PCA) with KNN (PCA-KNN).
To obtain reliable performance estimates, repeated random train–test splits (Monte Carlo cross-
validation) were conducted, and paired t-tests were applied to assess differences in predictive
accuracy and mitigate dependence on any single data partition. The remainder of this report
presents the exploratory data analysis, outlines the modeling procedures, compares classification
results across methods, and concludes with practical considerations for the design and selection
of fuel-efficient vehicles.
2. Exploratory Data Analysis
The cleaned dataset (“Auto1”) contains 392 observations and includes the binary response
variable “mpg01” along with seven numerical predictors describing engine characteristics and
vehicle design. Since the objective is to identify factors associated with high versus low fuel
efficiency, exploratory analysis focuses on understanding the relationship between “mpg01” and
each explanatory variable through scatterplots, boxplots, and correlation analysis.
2.1 Scatter Plots
The Scatter Plot Matrix (Figure 1) illustrates the relationship between vehicle specifications and
fuel efficiency (mpg). A pronounced non-linear, negative correlation exists between mpg and
engine-related metrics, specifically displacement, horsepower, and weight. As these values
increase, fuel economy decreases at a decaying rate. In contrast, acceleration and model year
exhibit moderate positive trends, indicating that newer vehicles and those with slower
acceleration profiles (often associated with smaller engines) tend to achieve higher fuel efficiency.
-1-
, Figure 1. Scatter Plots Matrix
2.2 Boxplots
The boxplots (Figure 2) evaluate the distribution of vehicle attributes partitioned by the binary
variable mpg01. High-efficiency vehicles (category 1) are characterized by significantly lower
medians for cylinders, displacement, horsepower, and weight compared to their low-efficiency
counterparts. The year distribution reveals that the high-efficiency cohort is skewed toward later
production dates, while the origin plot suggests a higher concentration of efficient vehicles within
specific manufacturing regions.
Figure 2. Boxplots
-2-
,2.3 Correlation Analysis
The correlation matrix (Table 1) quantifies the strength of linear associations across all variables.
“Mpg01” demonstrates its strongest negative associations with cylinders (-0.76), and
displacement (-0.75), and weight (-0.76). These figures identify engine size and vehicle mass as
the primary determinants for classifying fuel efficiency. Conversely, the high internal correlation
between displacement and cylinders (0.95) confirms that these features are nearly
interchangeable, suggesting a need for dimensionality reduction or careful feature selection in
the modeling phase.
Table 1. The correlation matrix
Mpg01 Cylinders Displacement Horsepower Weight Acceleration Year Origin
Mpg01 1.00 -0.76 -0.75 -0.67 -0.76 0.35 0.43 0.51
Cylinders -0.76 1.00 0.95 0.84 0.90 -0.50 -0.35 -0.57
Displacement -0.75 0.95 1.00 0.90 0.93 -0.54 -0.37 -0.61
Horsepower -0.67 0.84 0.90 1.00 0.86 -0.69 -0.42 -0.46
Weight -0.76 0.90 0.93 0.86 1.00 -0.42 -0.31 -0.59
Acceleration 0.35 -0.50 -0.54 -0.69 -0.42 1.00 0.29 0.21
Year 0.43 -0.35 -0.37 -0.42 -0.31 0.29 1.00 0.18
Origin 0.51 -0.57 -0.61 -0.46 -0.59 0.21 0.18 1.00
2.4 Pairs Plot
The Pairs Plot (Figure 3) provides a synthesized view of the dataset’s internal structure,
combining univariate density estimates with bivariate correlation coefficients. The diagonal
density plots reveal a notable right-skew in displacement and horsepower. The upper triangle
highlights severe multicollinearity among the independent variables, particularly between
cylinders, displacement, and weight, where correlation coefficients exceed 0.90. These high
values suggest that these variables convey redundant information, which may impact the stability
of subsequent multivariate models.
Figure 3. Pairs Plot
-3-
, 2.5 Key Determinants of Vehicle Efficiency
Overall, the exploratory analysis identifies cylinders, displacement, horsepower, weight, and
origin as the primary determinants of fuel efficiency. A robust inverse relationship exists between
engine scale and economy; specifically, increases in weight and displacement correspond to a
precipitous decline in mpg. Data distributions confirm that high-efficiency vehicles are
characterized by lower cylinder counts and reduced mass, whereas low-efficiency vehicles are
concentrated in the higher tiers of displacement and horsepower.
Beyond mechanical specifications, the origin variable serves as a significant categorical predictor,
likely reflecting the impact of regional manufacturing standards. However, the analysis reveals
substantial multicollinearity between cylinders, displacement, and weight, with correlation
coefficients exceeding 0.90. While these five variables are essential for understanding the data
structure, their high degree of redundancy suggests that a refined approach to feature selection
is necessary to ensure the statistical stability of subsequent classification models.
3. Methodology
This study evaluates the predictive performance of several classification algorithms to determine
whether a vehicle achieves high or low gas mileage, represented by the binary variable “mpg01”.
Based on the exploratory data analysis, only the predictors demonstrating the strongest
association with the response variable are included in the modeling process to ensure
computational efficiency and reduce noise.
The “Auto2” dataset is partitioned into a training set and an independent test set. To ensure a
systematic and reproducible split, approximately 10% of the observations (n 1 n/10) are
randomly sampled to form the testing data, while the remaining 90% are used for model training.
A fixed seed (set.seed(123)) is utilized during the sampling process to ensure results remain
consistent across different iterations. The response variable, “mpg01”, is explicitly converted into
a factor within the training set to facilitate classification modeling in R. This split is considered
reasonable as it provides a robust training base for the algorithms to learn underlying patterns
while maintaining a separate hold-out set to calculate a reliable estimate of the test error rate.
Six distinct classification strategies are implemented and compared:
3.1 Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis is implemented using the lda() function from the “MASS” package.
The method assumes that the predictors follow a multivariate normal distribution within each
class and share a common covariance matrix.
In this analysis, LDA estimates class-specific means for cylinders, displacement, horsepower,
weight, and origin, while pooling the covariance matrix across the two efficiency groups. The
resulting linear discriminant function defines a linear decision boundary. Class membership for
both training and test observations is determined by selecting the class with the highest posterior
probability returned by the predict() function.
-4-