Unit 4: Classification and Clustering
Classification vs. Prediction
Both classification and prediction are data mining techniques used in
supervised learning, where a model is built using a known set of data
(training data). However, they serve slightly different purposes:
Feature Classification Prediction
Assign data to predefined
Forecast or estimate
Objective discrete categories or
continuous values
classes
Categorical (e.g., Yes/No, Numerical (e.g., sales
Output Type
A/B/C) figures, temperature)
Classifying emails as Predicting house prices
Example
“spam” or “not spam” based on features
Decision Trees, Naive Linear Regression,
Algorithms
Bayes, SVM, k-NN, Neural Regression Trees, Neural
Used
Networks Networks
Training Requires labeled data with Requires labeled data with
Data known classes known numeric outcomes
Mean Squared Error
Evaluation Accuracy, Precision, Recall,
(MSE), Root Mean Squared
Metrics F1 Score
Error (RMSE)
, Supervised Learning
Supervised learning is a type of machine learning where the model is
trained on a labeled dataset — that is, each input data point is paired with
a known output (label). The goal is for the model to learn a mapping from
inputs to outputs so it can make accurate predictions on new, unseen data.
,Key Components:
Component Description
Training
Data with input-output pairs (features + labels).
Data
The algorithm that learns from the training data (e.g.,
Model
Decision Tree).
Prediction The output the model gives for new inputs after training.
The model’s predictions are compared to actual outputs
Feedback
to improve learning.
Types of Supervised Learning:
1. Classification
, oPredicts a category/class label
o Example: Email → Spam or Not Spam
2. Regression (Prediction)
o Predicts a continuous numeric value
o Example: Predicting house prices
Popular Algorithms:
Classification: Decision Trees, Naive Bayes, Support Vector
Machine (SVM), k-Nearest Neighbors (k-NN), Logistic Regression
Regression: Linear Regression, Ridge Regression, Regression
Trees
Advantages:
High accuracy when sufficient labeled data is available
Easy to evaluate and interpret
Models can be used for real-time decision making
Disadvantages:
Requires large amounts of labeled data
Can overfit if the model is too complex
Not suitable for discovering hidden patterns without labels
Applications:
Spam detection
Medical diagnosis
Credit scoring
Sales forecasting
Image and speech recognition