ISYE 7406 Homework 2
1. Introduction
Accurate estimation of body fat percentage plays an important role in health assessment, fitness
monitoring, and clinical decision-making. Direct measurement techniques are often costly or impractical,
which makes prediction based on easily obtainable physical measurements a meaningful and widely used
alternative. This project investigates statistical learning methods for predicting body fat percentage using
a multivariate dataset of anthropometric variables. The primary goal is to develop regression models that
achieve strong predictive accuracy while remaining interpretable, a balance that is particularly important
in biomedical and applied settings.
The dataset analyzed in this study includes body fat percentage, measured by the “Brozek” formula, as
the response variable, along with several continuous predictors describing physical characteristics. Like
many biomedical datasets, these predictors exhibit substantial correlation, which can undermine the
stability and generalizability of traditional regression models. This motivates the use of regularization-
based methods, which are designed to control model complexity, reduce variance, and mitigate
overfitting. To assess model robustness and predictive performance, the analysis employs Monte Carlo
cross-validation, allowing for systematic comparison across repeated train–test splits.
The analysis proceeds in several stages. First, exploratory data analysis (EDA) is conducted to examine the
distributional properties of the variables, identify potential anomalies, and explore relationships between
predictors and the response. Subsequent sections focus on fitting and tuning multiple regression models,
followed by a comparative evaluation of their predictive performance. The report concludes with a
discussion of the results and their implications for practical prediction of body fat percentage.
2. Exploratory Data Analysis
Exploratory data analysis (EDA) was conducted on the training dataset to understand the structure and
key characteristics of the data and to guide subsequent modeling decisions. All analyses were performed
in R, with relevant code provided in the Appendix. The primary objectives of this stage are to examine
variable distributions, assess relationships among predictors, identify potential anomalies or outliers, and
evaluate whether preprocessing steps—such as standardization or variable exclusion—are warranted
prior to model fitting.
2.1 Data Structure and Summary Statistics
To facilitate reliable model assessment, the full dataset was randomly partitioned into training and testing
subsets. Ten percent of the observations were reserved for testing, while the remaining data were used
for model estimation. This separation ensures that predictive performance is evaluated on observations
that are completely independent of the fitting process.
The training sample consists of 227 observations, all corresponding to male participants between 22 and
81 years of age. The mean (median) body weight is 179.3 (176.0) pounds, and the mean (median) height
is 70.16 (70.00) inches, suggesting that the sample is representative of a typical adult male population.
The response variable, “brozek”, which measures body fat percentage, has a mean of 18.99 and a median
of 19.10, indicating an approximately symmetric distribution with no strong skewness.
The training dataset consists of multiple continuous predictors representing anthropometric
measurements, including body circumferences, density, and fat-free mass. To characterize the scale and
-1-
,variability of each variable, summary statistics—mean, minimum, and maximum—were computed. These
summaries reveal pronounced differences in measurement scale across predictors: for example, density
values are tightly clustered near one, whereas weight and fat-free mass exceed 200. Such scale disparities
can influence penalized regression models, as coefficient estimates are sensitive to the relative
magnitudes of predictors. This observation underscores the need to standardize predictors prior to fitting
regularized models. Detailed summary statistics are provided in Appendix Part (b).
2.2 Key Findings and Modeling Implications
Further exploratory analysis was carried out using boxplots and a correlation heat map, which together
provide insight into distributional properties, outlier behavior, and the dependence structure among
variables.
2.2.1 Boxplot: Distributional Features and Outliers
The boxplots shown in Figure 1 summarize the marginal distributions of the response and predictor
variables and reveal several features relevant for model development. One prominent characteristic is
the wide variation in scales across predictors, reflecting the diverse physical units of measurement. While
such variation is expected in physiological data, it reinforces the need for predictor standardization prior
to applying regularized regression techniques.
A prominent characteristic of the dataset is the pronounced heterogeneity in predictor scales. Variables
such as body density are narrowly distributed around values close to one, whereas others, including
weight and free (fat-free) mass, routinely exceed 200 due to differences in their underlying physical units.
While such scale variation is common in physiological data, it can materially influence model estimation,
particularly for distance-based algorithms and regularized regression methods. Accordingly, standardizing
the predictors is a necessary preprocessing step to ensure that no single variable dominates the fitting
process solely due to its scale. In contrast, the response variable, brozek (body fat percentage), displays a
stable and approximately symmetric distribution, with nearly identical mean (18.99) and median (19.10).
This behavior supports the use of linear modeling without requiring transformation of the outcome.
Exploratory boxplots also reveal the presence of extreme and potentially erroneous observations. Notably,
the minimum recorded height of 29.5 inches is implausible for adult subjects and likely reflects a data
entry error or a severe outlier. Similarly, several density measurements fall below 1.0, outside the typical
physiological range of approximately 1.01 to 1.10, suggesting either extreme body composition or
measurement inaccuracy. Weight also exhibits pronounced right skewness, with a maximum value of
-2-
, 363.1 compared to a third quartile of 197.5, indicating a small number of unusually large individuals. These
observations may exert disproportionate influence on regression estimates and therefore warrant careful
consideration in subsequent modeling and diagnostic analyses.
Despite these anomalies, most extreme values remain within broadly plausible physiological limits. Rather
than excluding observations at this stage, the analysis proceeds with the full training dataset and relies on
regularization and cross-validation to reduce the influence of outliers on model estimation.
2.2.2 Heat Map of the Correlation Matrix: Multicollinearity and Predictor Relationships
The correlation heat map in Figure 2 reveals substantial multicollinearity among predictors, which poses
a key challenge for regression modeling. Several variables exhibit near-perfect correlations. In particular,
brozek and siri are perfectly correlated (correlation = 1.00), as both are derived from body density using
alternative formulas. Consistent with this relationship, density shows an almost perfect negative
correlation with brozek (−0.99). These associations are mechanically induced by the Brozek formula,
Body Fat=(4.57/Density−4.142)×100,
implying that increases in density necessarily correspond to decreases in estimated body fat percentage.
As a result, siri and density are excluded as predictors when modeling brozek to avoid redundancy and
violations of standard regression assumptions.
Strong correlations are also observed among other anthropometric variables. For example, weight is
highly correlated with BMI (adipos, 0.89) and hip circumference (0.94), while abdominal and chest
circumferences have a correlation of 0.92. Such dependence can inflate variance in ordinary least squares
estimates and lead to unstable coefficient estimates when many correlated predictors are included
simultaneously.
The heat map further indicates which predictors are most strongly associated with body fat percentage.
Abdominal circumference exhibits the strongest positive correlation with brozek (0.82), followed by BMI,
chest circumference, and hip circumference. These findings align with biological intuition, as measures of
central adiposity are closely linked to overall body fat. In contrast, height (−0.10) and age (0.26) show
relatively weak linear relationships with body fat percentage and are therefore expected to contribute
limited predictive power in linear models.
-3-
1. Introduction
Accurate estimation of body fat percentage plays an important role in health assessment, fitness
monitoring, and clinical decision-making. Direct measurement techniques are often costly or impractical,
which makes prediction based on easily obtainable physical measurements a meaningful and widely used
alternative. This project investigates statistical learning methods for predicting body fat percentage using
a multivariate dataset of anthropometric variables. The primary goal is to develop regression models that
achieve strong predictive accuracy while remaining interpretable, a balance that is particularly important
in biomedical and applied settings.
The dataset analyzed in this study includes body fat percentage, measured by the “Brozek” formula, as
the response variable, along with several continuous predictors describing physical characteristics. Like
many biomedical datasets, these predictors exhibit substantial correlation, which can undermine the
stability and generalizability of traditional regression models. This motivates the use of regularization-
based methods, which are designed to control model complexity, reduce variance, and mitigate
overfitting. To assess model robustness and predictive performance, the analysis employs Monte Carlo
cross-validation, allowing for systematic comparison across repeated train–test splits.
The analysis proceeds in several stages. First, exploratory data analysis (EDA) is conducted to examine the
distributional properties of the variables, identify potential anomalies, and explore relationships between
predictors and the response. Subsequent sections focus on fitting and tuning multiple regression models,
followed by a comparative evaluation of their predictive performance. The report concludes with a
discussion of the results and their implications for practical prediction of body fat percentage.
2. Exploratory Data Analysis
Exploratory data analysis (EDA) was conducted on the training dataset to understand the structure and
key characteristics of the data and to guide subsequent modeling decisions. All analyses were performed
in R, with relevant code provided in the Appendix. The primary objectives of this stage are to examine
variable distributions, assess relationships among predictors, identify potential anomalies or outliers, and
evaluate whether preprocessing steps—such as standardization or variable exclusion—are warranted
prior to model fitting.
2.1 Data Structure and Summary Statistics
To facilitate reliable model assessment, the full dataset was randomly partitioned into training and testing
subsets. Ten percent of the observations were reserved for testing, while the remaining data were used
for model estimation. This separation ensures that predictive performance is evaluated on observations
that are completely independent of the fitting process.
The training sample consists of 227 observations, all corresponding to male participants between 22 and
81 years of age. The mean (median) body weight is 179.3 (176.0) pounds, and the mean (median) height
is 70.16 (70.00) inches, suggesting that the sample is representative of a typical adult male population.
The response variable, “brozek”, which measures body fat percentage, has a mean of 18.99 and a median
of 19.10, indicating an approximately symmetric distribution with no strong skewness.
The training dataset consists of multiple continuous predictors representing anthropometric
measurements, including body circumferences, density, and fat-free mass. To characterize the scale and
-1-
,variability of each variable, summary statistics—mean, minimum, and maximum—were computed. These
summaries reveal pronounced differences in measurement scale across predictors: for example, density
values are tightly clustered near one, whereas weight and fat-free mass exceed 200. Such scale disparities
can influence penalized regression models, as coefficient estimates are sensitive to the relative
magnitudes of predictors. This observation underscores the need to standardize predictors prior to fitting
regularized models. Detailed summary statistics are provided in Appendix Part (b).
2.2 Key Findings and Modeling Implications
Further exploratory analysis was carried out using boxplots and a correlation heat map, which together
provide insight into distributional properties, outlier behavior, and the dependence structure among
variables.
2.2.1 Boxplot: Distributional Features and Outliers
The boxplots shown in Figure 1 summarize the marginal distributions of the response and predictor
variables and reveal several features relevant for model development. One prominent characteristic is
the wide variation in scales across predictors, reflecting the diverse physical units of measurement. While
such variation is expected in physiological data, it reinforces the need for predictor standardization prior
to applying regularized regression techniques.
A prominent characteristic of the dataset is the pronounced heterogeneity in predictor scales. Variables
such as body density are narrowly distributed around values close to one, whereas others, including
weight and free (fat-free) mass, routinely exceed 200 due to differences in their underlying physical units.
While such scale variation is common in physiological data, it can materially influence model estimation,
particularly for distance-based algorithms and regularized regression methods. Accordingly, standardizing
the predictors is a necessary preprocessing step to ensure that no single variable dominates the fitting
process solely due to its scale. In contrast, the response variable, brozek (body fat percentage), displays a
stable and approximately symmetric distribution, with nearly identical mean (18.99) and median (19.10).
This behavior supports the use of linear modeling without requiring transformation of the outcome.
Exploratory boxplots also reveal the presence of extreme and potentially erroneous observations. Notably,
the minimum recorded height of 29.5 inches is implausible for adult subjects and likely reflects a data
entry error or a severe outlier. Similarly, several density measurements fall below 1.0, outside the typical
physiological range of approximately 1.01 to 1.10, suggesting either extreme body composition or
measurement inaccuracy. Weight also exhibits pronounced right skewness, with a maximum value of
-2-
, 363.1 compared to a third quartile of 197.5, indicating a small number of unusually large individuals. These
observations may exert disproportionate influence on regression estimates and therefore warrant careful
consideration in subsequent modeling and diagnostic analyses.
Despite these anomalies, most extreme values remain within broadly plausible physiological limits. Rather
than excluding observations at this stage, the analysis proceeds with the full training dataset and relies on
regularization and cross-validation to reduce the influence of outliers on model estimation.
2.2.2 Heat Map of the Correlation Matrix: Multicollinearity and Predictor Relationships
The correlation heat map in Figure 2 reveals substantial multicollinearity among predictors, which poses
a key challenge for regression modeling. Several variables exhibit near-perfect correlations. In particular,
brozek and siri are perfectly correlated (correlation = 1.00), as both are derived from body density using
alternative formulas. Consistent with this relationship, density shows an almost perfect negative
correlation with brozek (−0.99). These associations are mechanically induced by the Brozek formula,
Body Fat=(4.57/Density−4.142)×100,
implying that increases in density necessarily correspond to decreases in estimated body fat percentage.
As a result, siri and density are excluded as predictors when modeling brozek to avoid redundancy and
violations of standard regression assumptions.
Strong correlations are also observed among other anthropometric variables. For example, weight is
highly correlated with BMI (adipos, 0.89) and hip circumference (0.94), while abdominal and chest
circumferences have a correlation of 0.92. Such dependence can inflate variance in ordinary least squares
estimates and lead to unstable coefficient estimates when many correlated predictors are included
simultaneously.
The heat map further indicates which predictors are most strongly associated with body fat percentage.
Abdominal circumference exhibits the strongest positive correlation with brozek (0.82), followed by BMI,
chest circumference, and hip circumference. These findings align with biological intuition, as measures of
central adiposity are closely linked to overall body fat. In contrast, height (−0.10) and age (0.26) show
relatively weak linear relationships with body fat percentage and are therefore expected to contribute
limited predictive power in linear models.
-3-