MACHINE LEARNING LECTURE NOTES
UNIT – IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold, Stratified K-Fold,
Leave-One-Out Cross Validation. Bias-Variance tradeoff, Regularization , Overfitting, Underfitting.
Ensemble Methods: Boosting, Bagging, Random Forest.
Cross-Validation in Machine Learning
Cross-validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data. We can also say that it is a technique to
check how a statistical model generalizes to an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means based only on the
training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After that, we test our model on that
sample before deployment, and this complete process comes under cross-validation. This is something
different from the general train-test split.
Hence the basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
Cross-validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data.
We can also say that it is a technique to check how a statistical model generalizes to an independent
dataset.
Data needs to split into:
Training data: Used for model development
Validation data: Used for validating the performance of the same model
BY
B SARITHA
1
, MACHINE LEARNING LECTURE NOTES
Extended version of Cross validation
there is always a need to validate the stability of your machine learning model. I mean you just can’t fit the
model to your training data and hope it would accurately work for the real data it has never seen
before. You need some kind of assurance that your model has got most of the patterns from the data
correct, and its not picking up too much on the noise, or in other words its low on bias and variance.
Validation
This process of deciding whether the numerical results quantifying hypothesized relationships between
variables, are acceptable as descriptions of the data, is known as validation. Generally, an error estimation
for the model is made after training, better known as evaluation of residuals. In this process, a numerical
estimate of the difference in predicted and original responses is done, also called the training error.
However, this only gives us an idea about how well our model does on data used to train it. Now its
possible that the model is underfitting or overfitting the data. So, the problem with this evaluation
BY
B SARITHA
2
, MACHINE LEARNING LECTURE NOTES
technique is that it does not give an indication of how well the learner will generalize to an independent/
unseen data set. Getting this idea about our model is known as Cross Validation.
Methods used for Cross-Validation
There are some common methods that are used for cross-validation. These methods are given below:
Leave-P-out cross-validation
Leave one out cross-validation
K-fold cross-validation
Stratified k-fold cross-validation
Holdout Method
Leave-P-out cross-validation
This approach leaves p data points out of training data, i.e. if there are n data points in the original sample
then, n-p samples are used to train the model and p points are used as the validation set. This is repeated
for all combinations in which original sample can be separated this way, and then the error is averaged for
all trials, to give overall effectiveness.
This method is exhaustive in the sense that it needs to train and validate the model for all possible
combinations, and for moderately large p, it can become computationally infeasible.
A particular case of this method is when p = 1. This is known as Leave one out cross validation. This
method is generally preferred over the previous one because it does not suffer from the intensive
computation, as number of possible combinations is equal to number of data points in original sample or n.
Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in
cases where you need to mitigate overfitting. It is also of use in determining the hyper parameters of your
model, in the sense that which parameters will result in lowest test error. This is all the basic you need to
get started with cross validation. You can get started with all kinds of validation techniques using Scikit-
Learn, that gets you up and running with just a few lines of code in python.
BY
B SARITHA
3
UNIT – IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold, Stratified K-Fold,
Leave-One-Out Cross Validation. Bias-Variance tradeoff, Regularization , Overfitting, Underfitting.
Ensemble Methods: Boosting, Bagging, Random Forest.
Cross-Validation in Machine Learning
Cross-validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data. We can also say that it is a technique to
check how a statistical model generalizes to an independent dataset.
In machine learning, there is always the need to test the stability of the model. It means based only on the
training dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After that, we test our model on that
sample before deployment, and this complete process comes under cross-validation. This is something
different from the general train-test split.
Hence the basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
Cross-validation is a technique for validating the model efficiency by training it on the subset of input
data and testing on previously unseen subset of the input data.
We can also say that it is a technique to check how a statistical model generalizes to an independent
dataset.
Data needs to split into:
Training data: Used for model development
Validation data: Used for validating the performance of the same model
BY
B SARITHA
1
, MACHINE LEARNING LECTURE NOTES
Extended version of Cross validation
there is always a need to validate the stability of your machine learning model. I mean you just can’t fit the
model to your training data and hope it would accurately work for the real data it has never seen
before. You need some kind of assurance that your model has got most of the patterns from the data
correct, and its not picking up too much on the noise, or in other words its low on bias and variance.
Validation
This process of deciding whether the numerical results quantifying hypothesized relationships between
variables, are acceptable as descriptions of the data, is known as validation. Generally, an error estimation
for the model is made after training, better known as evaluation of residuals. In this process, a numerical
estimate of the difference in predicted and original responses is done, also called the training error.
However, this only gives us an idea about how well our model does on data used to train it. Now its
possible that the model is underfitting or overfitting the data. So, the problem with this evaluation
BY
B SARITHA
2
, MACHINE LEARNING LECTURE NOTES
technique is that it does not give an indication of how well the learner will generalize to an independent/
unseen data set. Getting this idea about our model is known as Cross Validation.
Methods used for Cross-Validation
There are some common methods that are used for cross-validation. These methods are given below:
Leave-P-out cross-validation
Leave one out cross-validation
K-fold cross-validation
Stratified k-fold cross-validation
Holdout Method
Leave-P-out cross-validation
This approach leaves p data points out of training data, i.e. if there are n data points in the original sample
then, n-p samples are used to train the model and p points are used as the validation set. This is repeated
for all combinations in which original sample can be separated this way, and then the error is averaged for
all trials, to give overall effectiveness.
This method is exhaustive in the sense that it needs to train and validate the model for all possible
combinations, and for moderately large p, it can become computationally infeasible.
A particular case of this method is when p = 1. This is known as Leave one out cross validation. This
method is generally preferred over the previous one because it does not suffer from the intensive
computation, as number of possible combinations is equal to number of data points in original sample or n.
Cross Validation is a very useful technique for assessing the effectiveness of your model, particularly in
cases where you need to mitigate overfitting. It is also of use in determining the hyper parameters of your
model, in the sense that which parameters will result in lowest test error. This is all the basic you need to
get started with cross validation. You can get started with all kinds of validation techniques using Scikit-
Learn, that gets you up and running with just a few lines of code in python.
BY
B SARITHA
3