Summary

Summary/Lecture notes Data mining for Business & Governance

Rating

Sold

Pages

Uploaded on

25-03-2023

Written in

2022/2023

Summary/Lecture notes for the course Data Mining for Business and Governance. Includes all lectures.

Institution

Course

Content preview

Lectures data mining

Lecture 1
Pattern classification
- In this problem, we have 3 numerical variables (features) to be
used to predict the outcome (decision class).
- It’s multi-class since we have 3 possible outcomes
The goal in pattern classification is to build a model able to generalize
well beyond the historical training data.

In this lecture we cover 3 main things:
1. How to deal with missing values
2. How to compute the correlation/association between two
features
3. Methods to encode categorical features and handle class imbalance

Missing values
Missing values might result from fields that are not always applicable, incomplete
measurements, lost values.
Imputation strategies for missing values:
1. Simplest strategy → remove the feature containing missing values.
➢ Recommended when the majority of the instances (observations) have missing
values for that feature.
➢ However, there are situations in which we have a few features or the feature we
want to remove is deemed relevant.
2. If we have scattered missing values and few features, we might want to remove the
instances having missing values.
3. Most popular → replacing the missing values for a given feature with a
representative value such as the mean, the median or the mode of that feature.
➢ However, we need to be aware that we are introducing noise.
4. Fancier strategies include estimating the missing values with a machine learning
model trained on the non-missing information.
5. Autoencoders are deep neural networks
that involve two neural blocks named
encoder and decoder. The encoder reduces
the problem dimensionality while the
decoder completes the pattern.
➢ They use unsupervised learning to
adjust the weights that connect the
neurons.

,Feature scaling
1. Normalization
➢ Different features might encode different measurements
and scales (the age and height of a person)
➢ Normalization allows encoding all numeric features in the
[0,1] scale
➢ We subtract the minimum from the value to be
transformed and divide the result by the feature range.
2. Standardization
➢ This transformation method is similar to the
normalization, but the transformed values might not be in
the [0,1] interval
➢ We subtract the mean form the value to be transformed
and divide the result by the standard deviation.
➢ Normalization and standardization might lead to different
scaling results.

Normalization vs. standardization

- These feature scaling approaches might be affected by extreme values.

Feature interaction
1. Correlation between two numerical variables → Sometimes, we need to measure the
correlation between numerical features describing a certain problem domain.
➢ For example, what is the correlation between gender and income in Sweden?

2. Pearson’s correlation → it is used when we want to determine the correlation
between two numerical variables given k observations.
➢ It is intended for numerical variables only and its value lies in [-1, 1]
➢ The order of variables does not matter since the coefficient is symmetric.

Example: correlation between age and glucose levels

,The terminology can be different. We use correlation when we are working with numerical
data, and we use association when we are working with categorical data.

3. Association between two categorical variables → sometimes, we need to measure
the association degree between two categorical (ordinal or nominal) variables.
➢ For example, what is the association between gender and eye color?
4. The X2 association measure → it is used when we want to measure the association
between two categorical variables given k observations.
➢ We should compare the frequencies of values appearing together with their
individual frequencies
➢ The first step in that regard would be to create a contingency table.
➢ Let us assume that a categorical variable X involves m possible categories while Y
involves n categories.
➢ The observed value gives how many times each combination was found.
➢ The expected value is the multiplication of the individual frequencies divided by
the number of observations.

Association between gender and eye color

Such an example is very likely for in the exam.

Encoding strategies
Encoding categorical features → some machine learning, data mining algorithms or
platforms cannot operate with categorical features. Therefore, we need to encode these
features as numerical quantities.
1. Label encoding → consists of assigning integer numbers to each category. It only
makes sense if there is an ordinal relationship among the categories.
➢ E.g., weekdays, months, star-based hotel ratings, income
categories.
2. One-hot encoding → is used to encode nominal features that
lack an ordinal relationship. Each category of the categorical
feature is transformed into a binary feature such that one
marks the category.
➢ This strategy often increases the problem dimensionality
notably since each feature is encoded as a binary vector.

, Class imbalance
Sometimes we have problems with much more instances belonging to a decision class than
the other classes.
- In this example, we have more instances labelled with the
negative decision class than the positive one.
Classifiers are tempted to recognize the majority decision class only.

Simple strategies:
1. Under sampling
2. Oversampling
One strategy is to select some instances from the majority decision class,
provided we retain enough instances.
Another method consists of creating new instances belonging to the
minority class (creating random copies)
These strategies are applied to the data when building the model.

SMOTE → synthetic minority oversampling technique. It is a popular
strategy to deal with class imbalance.
- Creates synthetic instances in the neighborhoods of instances
belonging to the minority class.
- Caution is advised since the classifier is forced to learn from
artificial instances, which might induce noise.

Lecture 2
Classification problem
In this problem, we have four categorical (ordinal and nominal) features to be
used to predict the outcome.

We have only two possible outcomes or decision classes (binary problem).
The goal in pattern classification is to build a model to generalize well beyond
the historical training data.

Rule-based learning: in this approach, the classification problem is modelled as
a set of rules involving features and their values in the antecedent of such rules
and decision classes in the consequent.
- Algorithm → decision trees are perhaps the most popular algorithm of this
paradigm.

Report Copyright Violation

Written for

Institution: Tilburg University (UVT)
Study: Data Science & Society
Course: Data mining (880662M6)

All documents for this subject (2)

Document information

Uploaded on: March 25, 2023
Number of pages: 38
Written in: 2022/2023
Type: SUMMARY

Subjects

data mining
data science
tilburg university

$7.24

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

sophiedekkers54

3.0

(2)

Get to know the seller

sophiedekkers54 Tilburg University

View profile

Sold

Member since

7 year

Number of followers

Documents

Last sold

6 months ago

3.0

2 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller sophiedekkers54. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.24. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 55172 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary/Lecture notes Data mining for Business & Governance

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?