Unit 2: Data Preprocessing
What is Data Preprocessing?
Data preprocessing is a crucial step in the data mining and machine
learning process that involves transforming raw data into a clean and
organized format suitable for analysis. Real-world data is often
incomplete, inconsistent, or noisy, and preprocessing ensures that the data
quality is improved, enabling better model performance and more
accurate results.
Tasks in Data Preprocessing
Data preprocessing involves several key tasks aimed at improving data
quality and making it suitable for analysis. Below are the main tasks
involved:
,1. Data Cleaning
Handling missing values: Replace with mean/median/mode, use
interpolation, or remove records.
Removing duplicates: Identify and eliminate redundant data.
Correcting errors: Fix inconsistencies or typos in data entries.
Noise removal: Smooth noisy data using techniques like binning,
regression, or clustering.
2. Data Integration
Combining data from multiple sources: Merge datasets from
different databases or formats.
Schema alignment: Match and unify different attribute names and
types.
Handling data conflicts: Resolve inconsistencies across data
sources.
3. Data Transformation
Normalization/Scaling: Adjust data to a common scale (e.g., Min-
Max scaling, Z-score normalization).
Encoding categorical data: Convert categories into numerical
values (e.g., one-hot encoding, label encoding).
Attribute/Feature construction: Create new relevant features from
existing data.
Aggregation: Summarize data (e.g., monthly sales from daily data).
, 4. Data Reduction
Dimensionality reduction: Reduce the number of variables (e.g.,
PCA, feature selection).
Numerosity reduction: Replace or remove redundant data without
losing information (e.g., histograms, clustering).
Sampling: Select a representative subset of data for faster
processing.
5. Data Discretization
Converting continuous data into intervals or categories (e.g., age
ranges: 0–18, 19–35, etc.).
Supervised or unsupervised binning methods can be used.
6. Data Binarization
Convert numerical or categorical data into binary form.
o Example: Convert “Gender” (Male/Female) into 0 and 1.
Reasons of Missing Values & Noisy Data
Missing values occur when no data value is stored for a variable in an
observation. Common reasons include:
1. Human Error
o Data entry mistakes or omissions by users or operators.