ISYE7406 HW5
Introduction
Student placement prediction is a common application of statistical learning
methods, as it involves identifying factors that influence whether a student secures
employment. Understanding these factors can provide useful insights for both
students preparing for job placement and organizations aiming to improve
candidate selection.
In this study, I analyze a Student Placement Dataset consisting of 10,000
observations and multiple predictor variables describing students’ academic
performance, skills, and background characteristics. The response variable is
placement status, a binary outcome indicating whether a student is placed or not.
The primary objective of this report is to apply and compare a range of statistical
learning methods for predicting placement outcomes. These include baseline
approaches such as K-Nearest Neighbors (KNN), and Naïve Bayes, as well as more
advanced ensemble methods including Random Forest and Gradient Boosting
Machine (GBM).
All models are trained on a training set and evaluated on a held out test set. Model
performance is compared in terms of test error and classification accuracy. In
addition, hyperparameters for each method are selected using cross validation on
the training set to ensure a fair and unbiased comparison.
Through this analysis, we aim to identify which modeling approach achieves the
best predictive performance and to understand the strengths and limitations of
different methods when applied to this dataset.
Exploratory Data Analysis
The dataset used in this project was obtained from Kaggle:
https://www.kaggle.com/datasets/rakesh630/global-student-placement-2025-
dataset/data
The dataset consists of 10,000 student records with 11 predictor variables,
capturing both institutional characteristics and individual level attributes. These
variables include university related information, such as country, college tier, and
,university ranking band, as well as student specific factors including academic
performance, internship experience, skill assessments, and specialization.
The dataset contains two outcome variables: placement status, a binary indicator of
whether a student was Placed or Not Placed, and salary. The salary variable was
excluded from the analysis because it is only observed for placed students and is
missing for all non placed observations (3,848 cases). As salary reflects a post-
placement outcome rather than a pre- placement characteristic, including it may
introduce bias into the predictive models.
To facilitate analysis, the predictors were grouped into several categories.
Institutional characteristics include college tier and university ranking band, which
reflect the overall quality and reputation of the educational institution. Academic
metrics consist of CGPA and backlogs, capturing students’ academic performance.
Experience and skill related variables include internship count, internship quality
score, aptitude score, and communication score. Field of study and industry
alignment are represented by specialization and industry, while geographic
information is captured by country.
The dataset includes amix of numerical variables such as CGPA, aptitude score,
communication score and categorical variable including specialization, industry,
and college tier, providing a diverse set of features for modeling.
Figure 1. Summary Statistics of the Dataset
From the summary statistics, CGPA ranges from 4 to 10, with a mean of
approximately 7, and appears to follow an approximately normal distribution. In
contrast, both backlogs and internship count exhibit right skewed distributions. This
is consistent with expectations, as most students have relatively few backlogs
(median = 1), indicating that academic failure is uncommon. Similarly, internship
counts are generally low, suggesting that many students have limited practical
experience.
, Figure 2. Histogram of Placement Status
The target variable, placement_status, exhibits a moderate class imbalance:
approximately 61.5% of students are placed, while 38.5% are not placed.
Figure 3. Boxplot and Density Plot of CGPA by Placement Status
The density plot indicates that placed students generally have higher CGPA values,
with a peak around 7.5, whereas non placed students exhibit a peak closer to 6.5.
The boxplot further supports this pattern, showing that the median CGPA for placed
students is higher than that of non placed students.
Both groups contain some outliers. For non placed students, CGPA ranges from
approximately 4.0 to above 9.0, while for placed students the range extends from
about 4.7 to 10.0. These outliers are not removed, as a high CGPA does not
guarantee job placement. In real world hiring processes, multiple factors are
considered beyond academic performance alone.
Introduction
Student placement prediction is a common application of statistical learning
methods, as it involves identifying factors that influence whether a student secures
employment. Understanding these factors can provide useful insights for both
students preparing for job placement and organizations aiming to improve
candidate selection.
In this study, I analyze a Student Placement Dataset consisting of 10,000
observations and multiple predictor variables describing students’ academic
performance, skills, and background characteristics. The response variable is
placement status, a binary outcome indicating whether a student is placed or not.
The primary objective of this report is to apply and compare a range of statistical
learning methods for predicting placement outcomes. These include baseline
approaches such as K-Nearest Neighbors (KNN), and Naïve Bayes, as well as more
advanced ensemble methods including Random Forest and Gradient Boosting
Machine (GBM).
All models are trained on a training set and evaluated on a held out test set. Model
performance is compared in terms of test error and classification accuracy. In
addition, hyperparameters for each method are selected using cross validation on
the training set to ensure a fair and unbiased comparison.
Through this analysis, we aim to identify which modeling approach achieves the
best predictive performance and to understand the strengths and limitations of
different methods when applied to this dataset.
Exploratory Data Analysis
The dataset used in this project was obtained from Kaggle:
https://www.kaggle.com/datasets/rakesh630/global-student-placement-2025-
dataset/data
The dataset consists of 10,000 student records with 11 predictor variables,
capturing both institutional characteristics and individual level attributes. These
variables include university related information, such as country, college tier, and
,university ranking band, as well as student specific factors including academic
performance, internship experience, skill assessments, and specialization.
The dataset contains two outcome variables: placement status, a binary indicator of
whether a student was Placed or Not Placed, and salary. The salary variable was
excluded from the analysis because it is only observed for placed students and is
missing for all non placed observations (3,848 cases). As salary reflects a post-
placement outcome rather than a pre- placement characteristic, including it may
introduce bias into the predictive models.
To facilitate analysis, the predictors were grouped into several categories.
Institutional characteristics include college tier and university ranking band, which
reflect the overall quality and reputation of the educational institution. Academic
metrics consist of CGPA and backlogs, capturing students’ academic performance.
Experience and skill related variables include internship count, internship quality
score, aptitude score, and communication score. Field of study and industry
alignment are represented by specialization and industry, while geographic
information is captured by country.
The dataset includes amix of numerical variables such as CGPA, aptitude score,
communication score and categorical variable including specialization, industry,
and college tier, providing a diverse set of features for modeling.
Figure 1. Summary Statistics of the Dataset
From the summary statistics, CGPA ranges from 4 to 10, with a mean of
approximately 7, and appears to follow an approximately normal distribution. In
contrast, both backlogs and internship count exhibit right skewed distributions. This
is consistent with expectations, as most students have relatively few backlogs
(median = 1), indicating that academic failure is uncommon. Similarly, internship
counts are generally low, suggesting that many students have limited practical
experience.
, Figure 2. Histogram of Placement Status
The target variable, placement_status, exhibits a moderate class imbalance:
approximately 61.5% of students are placed, while 38.5% are not placed.
Figure 3. Boxplot and Density Plot of CGPA by Placement Status
The density plot indicates that placed students generally have higher CGPA values,
with a peak around 7.5, whereas non placed students exhibit a peak closer to 6.5.
The boxplot further supports this pattern, showing that the median CGPA for placed
students is higher than that of non placed students.
Both groups contain some outliers. For non placed students, CGPA ranges from
approximately 4.0 to above 9.0, while for placed students the range extends from
about 4.7 to 10.0. These outliers are not removed, as a high CGPA does not
guarantee job placement. In real world hiring processes, multiple factors are
considered beyond academic performance alone.