Unit–V
Advanced learning: Sampling - Basic sampling methods - Monte Carlo - Reinforcement
Learning - K-Armed Bandit Elements - Model Based Learning - Value Iteration - Policy
Iteration. Temporal Difference Learning - Exploration Strategies - Deterministic and Non-
deterministic Rewards and Actions - Eligibility Traces - Generalization- Partially
Observable States- The Setting- Example- Semi Supervised Learning-Computational
Learning Theory -Mistake bound analysis - Sample complexity analysis - VC dimension-
Occam learning - Accuracy and confidence boosting .
,Sampling - Basic sampling methods
In machine learning, sampling methods play a critical role in model training and
evaluation. These sampling methods are integral to building robust, efficient, and accurate
machine learning models. The choice of sampling technique depends on the specific
problem, the nature of the data, and the goals of the analysis. Some of the sampling types
are:
1. Random Sampling
Randomly selecting a subset of data from the entire dataset. This is commonly used to
create training and test sets or to perform cross-validation.
Advantages: Simple and effective for creating unbiased samples.
Disadvantages: May not always capture the full complexity of the data distribution.
2. Stratified Sampling
Ensures that each class (or stratum) is represented proportionally in the sample. This is
particularly useful for imbalanced datasets where some classes are underrepresented.
Advantages: Maintains the class distribution in the sample, which helps in creating
more balanced training and test sets.
Disadvantages: Requires knowledge of class distributions and can be more complex to
implement.
,Sampling - Basic sampling methods
3. Oversampling and Undersampling
Oversampling: Increasing the number of instances in the minority class by duplicating
or generating new instances (e.g., using SMOTE - Synthetic Minority Over-sampling
Technique).
Undersampling: Reducing the number of instances in the majority class to balance the
dataset.
Advantages: Helps address class imbalance, leading to better model performance on
minority classes.
Disadvantages: Oversampling can lead to overfitting, while undersampling may discard
useful data.
4.Cross-Validation
The dataset is split into multiple folds (e.g., 5 or 10), and the model is trained and
evaluated multiple times, each time using a different fold as the test set and the
remaining folds as the training set.
Advantages: Provides a more robust estimate of model performance by using multiple
training/testing splits.
Disadvantages: Computationally expensive and time-consuming, especially with large
datasets and complex models.
, Sampling - Basic sampling methods
5.Bootstrapping
Generating multiple datasets by sampling with replacement from the original dataset.
Each bootstrap sample is used to train the model, and predictions are aggregated.
Advantages: Useful for estimating the distribution of a statistic and assessing model
stability.
Disadvantages: Can lead to biased estimates if not managed properly; computationally
intensive.
6. Leave-One-Out Cross-Validation (LOOCV)
A special case of k-fold cross-validation where k is equal to the number of instances in
the dataset. Each instance is used once as a test set while the remaining instances are
used for training.
Advantages: Maximizes the training data for each model iteration, providing a thorough
evaluation.
Disadvantages: Computationally expensive for large datasets due to the need to train
the model as many times as there are instances.
Advanced learning: Sampling - Basic sampling methods - Monte Carlo - Reinforcement
Learning - K-Armed Bandit Elements - Model Based Learning - Value Iteration - Policy
Iteration. Temporal Difference Learning - Exploration Strategies - Deterministic and Non-
deterministic Rewards and Actions - Eligibility Traces - Generalization- Partially
Observable States- The Setting- Example- Semi Supervised Learning-Computational
Learning Theory -Mistake bound analysis - Sample complexity analysis - VC dimension-
Occam learning - Accuracy and confidence boosting .
,Sampling - Basic sampling methods
In machine learning, sampling methods play a critical role in model training and
evaluation. These sampling methods are integral to building robust, efficient, and accurate
machine learning models. The choice of sampling technique depends on the specific
problem, the nature of the data, and the goals of the analysis. Some of the sampling types
are:
1. Random Sampling
Randomly selecting a subset of data from the entire dataset. This is commonly used to
create training and test sets or to perform cross-validation.
Advantages: Simple and effective for creating unbiased samples.
Disadvantages: May not always capture the full complexity of the data distribution.
2. Stratified Sampling
Ensures that each class (or stratum) is represented proportionally in the sample. This is
particularly useful for imbalanced datasets where some classes are underrepresented.
Advantages: Maintains the class distribution in the sample, which helps in creating
more balanced training and test sets.
Disadvantages: Requires knowledge of class distributions and can be more complex to
implement.
,Sampling - Basic sampling methods
3. Oversampling and Undersampling
Oversampling: Increasing the number of instances in the minority class by duplicating
or generating new instances (e.g., using SMOTE - Synthetic Minority Over-sampling
Technique).
Undersampling: Reducing the number of instances in the majority class to balance the
dataset.
Advantages: Helps address class imbalance, leading to better model performance on
minority classes.
Disadvantages: Oversampling can lead to overfitting, while undersampling may discard
useful data.
4.Cross-Validation
The dataset is split into multiple folds (e.g., 5 or 10), and the model is trained and
evaluated multiple times, each time using a different fold as the test set and the
remaining folds as the training set.
Advantages: Provides a more robust estimate of model performance by using multiple
training/testing splits.
Disadvantages: Computationally expensive and time-consuming, especially with large
datasets and complex models.
, Sampling - Basic sampling methods
5.Bootstrapping
Generating multiple datasets by sampling with replacement from the original dataset.
Each bootstrap sample is used to train the model, and predictions are aggregated.
Advantages: Useful for estimating the distribution of a statistic and assessing model
stability.
Disadvantages: Can lead to biased estimates if not managed properly; computationally
intensive.
6. Leave-One-Out Cross-Validation (LOOCV)
A special case of k-fold cross-validation where k is equal to the number of instances in
the dataset. Each instance is used once as a test set while the remaining instances are
used for training.
Advantages: Maximizes the training data for each model iteration, providing a thorough
evaluation.
Disadvantages: Computationally expensive for large datasets due to the need to train
the model as many times as there are instances.