CHAPTER-1
1.1 Introduction to Data Mining
1.2 Data Mining-Definition and Functionalities
1.3 Classification of Data mining systems
1.4 Data mining Architecture
1.5 A data Mining: KDD Process
1.6 Major Issues in Data Mining
1.7 Applications of Data Mining
Introduction of Data Mining
Extraction of implicit, previously unknown and potentially useful
information from data
Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns
Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
Data Mining also known as Knowledge discovery (mining) in databases
(KDD), knowledge extraction, data/pattern analysis, data archaeology, data
dredging, information harvesting, business intelligence, etc
As data is growing at very remarkable rate, there comes a need to analyse
large, complex and information rich data sets to gain the hidden
information. This may result into greater customer satisfaction and
remarkable turn over for the firm.
Why do We Need Mata Mining?
• Lots of data is being collected and warehoused
- Web data, e-commerce
- purchases at department/ grocery stores
- Bank/Credit Card transactions
, • Computers have become cheaper and more powerful
• Competitive Pressure is Strong
- Provide better, customized services for an edge (e.g. in Customer
Relationship Management)
Data Mining Functionality
Concept description: Characterization and discrimination
- Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions
Association (correlation and causality) :
- multi-dimensional vs. single-dimensional association
- age(X, ―20..29‖) ^ income(X, ―20..29K‖) ->buys(X, ―PC‖) [support
= 2%, confidence = 60%]
- contains(T, ―computer‖) -> contains(x, ―software‖) [1%, 75%]19
Classification and Prediction:
- Finding models (functions) that describe and distinguish classes or
concepts for future prediction
- E.g., classify countries based on climate, or classify cars based on
gas mileage
- Presentation: decision-tree, classification rule, neural network
- Prediction: Predict some unknown or missing numerical values
Cluster analysis :
- Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
- Clustering based on the principle: maximizing the intra-class
similarity and minimizing the interclass similarity20
Outlier analysis :
- Outlier: a data object that does not comply with the general
behaviour of the data C