Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Summary

Summary Data Mining - All reading material

Rating
-
Sold
-
Pages
23
Uploaded on
19-10-2024
Written in
2023/2024

This is a summary of the book you need to learn before the exam. All the dust is in here.

Institution
Course

Content preview

Data mining reading material

Chapter 1: Introduction

Data mining is the process of automatically discovering useful information in large data repositories.
Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall
process of converting raw data into useful information. This process consists of a series of steps, from
data preprocessing to postprocessing of data mining results. The purpose of preprocessing is to
transform the raw input data into an appropriate format for subsequent analysis . An example of
postprocessing is visualization, which allows analysts to explore the data and the data mining results
from a variety of viewpoints. Hypothesis testing methods can also be applied during postprocessing
to eliminate spurious data mining results.

Specific challenges that motivated the development of data mining

- Scalability
- High dimensionality
- Heterogeneous and complex data
- Data ownership and distribution
- Non-traditional analysis

Data mining researchers draw upon ideas, such as (1) sampling, estimation, and hypothesis testing
from statistics and (2) search algorithms, modelling techniques, and learning theories from artificial
intelligence, pattern recognition, and machine learning.

Data mining tasks are generally divided into two major categories:

- Predictive tasks
o The objective of these tasks is to predict the value of a particular attribute based on
the values of other attributes. The attribute to be predicted is commonly known as
the target or dependent variable, while the attributes used for making the prediction
are known as the explanatory or independent variables.
- Descriptive tasks
o Here, the objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data
mining tasks are often exploratory in nature and frequently require postprocessing
techniques to validate and explain the results.

Predictive modelling refers to the task of building a model for the target variable as a function of the
explanatory variables. There are two types of predictive modelling tasks:

- Classification – used for discrete target variables
- Regression – used for continuous target variables

The goal of both tasks is to learn a model that minimizes the error between the predicted and true
values of the target variable.

Association analysis is used to discover patterns that describe strongly associated features in the
data.

Cluster analysis seeks to find groups of closely related observations so that observations that belong
to the same cluster are more similar to each other than observations that belong to other clusters.

,Anomaly detection is the task of identifying observations whose characteristics are significantly
different from the rest of the data. Such observations are known as anomalies or outliers. The goal of
an anomaly detection algorithm is to discover the real anomalies and avoid falsely labelling normal
objects as anomalous.

Chapter 2: Data

The Type of Data: Data sets differ in a number of ways. The type of data determines which tools and
techniques can be used to analyse the data.
The Quality of the Data: Data is often far from perfect. Data quality issues that often need to be
addressed include the presence of noise and outliers; missing, inconsistent, or duplicate data; and
data that is biased or, in some other way, unrepresentative of the phenomenon or population that
the data is supposed to describe.
Preprocessing Steps to Make the Data More Suitable for Data Mining: Often, the raw data must be
processed in order to make it suitable for analysis.
Analysing Data in Terms of Its Relationships: One approach to data analysis is to find relationships
among the data objects and then perform the remaining analysis using these relationships rather
than the data objects themselves.

2.1 Types of Data

A data set can often be viewed as a collection of data objects. In turn, data objects are described by a
number of attributes that capture the characteristics of an object.

An attribute is a property or characteristic of an object that can vary, either from one object to
another or from one time to another. A measurement scale is a rule (function) that associates a
numerical or symbolic value with an attribute of an object. Formally, the process of measurement is
the application of a measurement scale to associate a value with a particular attribute of a specific
object.

It is common to refer to the type of an attribute as the type of a measurement scale.

The following properties (operations) of numbers are typically used to describe attributes:

- Distinctness = and /=
- Order <, <=, >, and =>
- Addition + and –
- Multiplication x and /

Given these properties, we can define four types of attributes:

- Categorical (qualitative)
o Nominal
 The values of a nominal attribute are just different names.
 Only provides enough information to distinguish objects from another.
 Transformation: any one-to-one mapping.
o Ordinal
 Provide enough information to order objects.
 Transformation: an order-preserving change of values.
- Numeric (quantitative)
o Interval

,  The differences between values are meaningful, a unit of measurement
exists, addition.
 Transformation: new_value = a x old_value + b.
 a and b are constants.
o Ratio
 Both differences and ratios are meaningful, multiplication.
 Transformation: new_value = a x old_value.

Each attribute type possesses all of the properties and operations of the attribute types above it.

An independent way of distinguishing between attributes is by the number of values they can take.

- Discrete – a discrete attribute has a finite or countably infinite set of values.
- Binary - are a special case of discrete attributes and assume only two values, e.g., true/false,
yes/no, male/female, or 0/1.
- Continuous – a continuous attribute is one whose values are real numbers. Practically, real
values can be measured and represented only with limited precision.

Typically, nominal and ordinal attributes are binary or discrete, while interval and ratio attributes are
continuous. However, count attributes , which are discrete, are also ratio attributes.

For asymmetric attributes, only presence—a non-zero attribute value—is regarded as important.
Binary attributes where only non-zero values are important are called asymmetric binary attributes.
It is also possible to have discrete or continuous asymmetric features.

Types of data sets

For convenience, we have grouped the types of data sets into three groups: record data, graph-based
data, and ordered data.
Before providing details of specific kinds of data sets, we discuss three characteristics that apply to
many data sets and have a significant impact on the data mining techniques that are used:

- Dimensionality
o The number of attributes that the objects in the data set possess. Analysing data with
a small number of dimensions tends to be qualitatively different from analysing
moderate or high-dimensional data. Indeed, the difficulties associated with the
analysis of high-dimensional data are sometimes referred to as the curse of
dimensionality. Because of this, an important motivation in preprocessing the data is
dimensionality reduction.
- Distribution
o The frequency of occurrence of various values or sets of values for the attributes
comprising data objects. For example, suppose a categorical attribute is used as a
class variable, where one of the categories occurs 95% of the time, while the other
categories together occur only 5% of the time. This skewness in the distribution can
make classification difficult. A special case of skewed data is sparsity. For sparse
binary, count or continuous data, most attributes of an object have values of 0. In
many cases, fewer than 1% of the values are non-zero. In practical terms, sparsity is
an advantage because usually only the non-zero values need to be stored and
manipulated.
- Resolution

Written for

Institution
Study
Course

Document information

Uploaded on
October 19, 2024
Number of pages
23
Written in
2023/2024
Type
SUMMARY

Subjects

$7.26
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
donjaschipper
4.0
(1)

Also available in package deal

Get to know the seller

Seller avatar
donjaschipper Radboud Universiteit Nijmegen
Follow You need to be logged in order to follow users or courses
Sold
5
Member since
1 year
Number of followers
0
Documents
9
Last sold
1 month ago

4.0

1 reviews

5
0
4
1
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions