Cross-validation is a technique in machine learning used to assess the
generalization performance of a model. It divides the dataset into subsets,
trains the model on some of these subsets, and evaluates it on others. The
most common type is k-fold cross-validation, where the data is split into
k subsets or "folds."
Here’s how it works:
1. Split the dataset into k equally sized folds.
2. For each fold:
o Train the model on k-1 folds.
o Test the model on the remaining fold.
3. Average the performance over the k trials to estimate the model's
accuracy.
This process helps to ensure the model's performance is not dependent on
the specific data split, reducing the risk of overfitting.
Numerical Example 1: 5-Fold Cross-Validation
Suppose we have a dataset with 1000 data points and want to apply 5-fold
cross-validation.
1. Split the dataset into 5 equal parts (200 samples in each fold).
2. For each iteration:
o Fold 1: Train on folds 2, 3, 4, 5; Test on fold 1.
o Fold 2: Train on folds 1, 3, 4, 5; Test on fold 2.
o Fold 3: Train on folds 1, 2, 4, 5; Test on fold 3.
o Fold 4: Train on folds 1, 2, 3, 5; Test on fold 4.
o Fold 5: Train on folds 1, 2, 3, 4; Test on fold 5.
3. The performance is averaged across these 5 tests to get an overall
accuracy.
Example Performance:
Fold 1 accuracy: 85%
Fold 2 accuracy: 87%
Fold 3 accuracy: 84%
Fold 4 accuracy: 88%
Fold 5 accuracy: 86%
Average Accuracy = (85 + 87 + 84 + 88 + 86) / 5 = 86%
Numerical Example 2: Leave-One-Out Cross-Validation (LOOCV)
generalization performance of a model. It divides the dataset into subsets,
trains the model on some of these subsets, and evaluates it on others. The
most common type is k-fold cross-validation, where the data is split into
k subsets or "folds."
Here’s how it works:
1. Split the dataset into k equally sized folds.
2. For each fold:
o Train the model on k-1 folds.
o Test the model on the remaining fold.
3. Average the performance over the k trials to estimate the model's
accuracy.
This process helps to ensure the model's performance is not dependent on
the specific data split, reducing the risk of overfitting.
Numerical Example 1: 5-Fold Cross-Validation
Suppose we have a dataset with 1000 data points and want to apply 5-fold
cross-validation.
1. Split the dataset into 5 equal parts (200 samples in each fold).
2. For each iteration:
o Fold 1: Train on folds 2, 3, 4, 5; Test on fold 1.
o Fold 2: Train on folds 1, 3, 4, 5; Test on fold 2.
o Fold 3: Train on folds 1, 2, 4, 5; Test on fold 3.
o Fold 4: Train on folds 1, 2, 3, 5; Test on fold 4.
o Fold 5: Train on folds 1, 2, 3, 4; Test on fold 5.
3. The performance is averaged across these 5 tests to get an overall
accuracy.
Example Performance:
Fold 1 accuracy: 85%
Fold 2 accuracy: 87%
Fold 3 accuracy: 84%
Fold 4 accuracy: 88%
Fold 5 accuracy: 86%
Average Accuracy = (85 + 87 + 84 + 88 + 86) / 5 = 86%
Numerical Example 2: Leave-One-Out Cross-Validation (LOOCV)