DATA MINING
Data
Preprocessing
1
, Contents
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation
◼ Data Discretization
2
, Data Quality: Why Preprocess the Data?
◼ Measures for data quality: A multidimensional view
◼ Accuracy: correct or wrong, accurate or not ◼ Completeness: not
recorded, unavailable, … ◼ Consistency: some modified but
some not, … ◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?
3
, Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
4
Data
Preprocessing
1
, Contents
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation
◼ Data Discretization
2
, Data Quality: Why Preprocess the Data?
◼ Measures for data quality: A multidimensional view
◼ Accuracy: correct or wrong, accurate or not ◼ Completeness: not
recorded, unavailable, … ◼ Consistency: some modified but
some not, … ◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?
3
, Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
4