Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Summary

Summary/Lecture notes Data mining for Business & Governance

Rating
-
Sold
11
Pages
38
Uploaded on
25-03-2023
Written in
2022/2023

Summary/Lecture notes for the course Data Mining for Business and Governance. Includes all lectures.

Institution
Course

Content preview

Lectures data mining

Lecture 1
Pattern classification
- In this problem, we have 3 numerical variables (features) to be
used to predict the outcome (decision class).
- It’s multi-class since we have 3 possible outcomes
The goal in pattern classification is to build a model able to generalize
well beyond the historical training data.

In this lecture we cover 3 main things:
1. How to deal with missing values
2. How to compute the correlation/association between two
features
3. Methods to encode categorical features and handle class imbalance

Missing values
Missing values might result from fields that are not always applicable, incomplete
measurements, lost values.
Imputation strategies for missing values:
1. Simplest strategy → remove the feature containing missing values.
➢ Recommended when the majority of the instances (observations) have missing
values for that feature.
➢ However, there are situations in which we have a few features or the feature we
want to remove is deemed relevant.
2. If we have scattered missing values and few features, we might want to remove the
instances having missing values.
3. Most popular → replacing the missing values for a given feature with a
representative value such as the mean, the median or the mode of that feature.
➢ However, we need to be aware that we are introducing noise.
4. Fancier strategies include estimating the missing values with a machine learning
model trained on the non-missing information.
5. Autoencoders are deep neural networks
that involve two neural blocks named
encoder and decoder. The encoder reduces
the problem dimensionality while the
decoder completes the pattern.
➢ They use unsupervised learning to
adjust the weights that connect the
neurons.

,Feature scaling
1. Normalization
➢ Different features might encode different measurements
and scales (the age and height of a person)
➢ Normalization allows encoding all numeric features in the
[0,1] scale
➢ We subtract the minimum from the value to be
transformed and divide the result by the feature range.
2. Standardization
➢ This transformation method is similar to the
normalization, but the transformed values might not be in
the [0,1] interval
➢ We subtract the mean form the value to be transformed
and divide the result by the standard deviation.
➢ Normalization and standardization might lead to different
scaling results.

Normalization vs. standardization




- These feature scaling approaches might be affected by extreme values.

Feature interaction
1. Correlation between two numerical variables → Sometimes, we need to measure the
correlation between numerical features describing a certain problem domain.
➢ For example, what is the correlation between gender and income in Sweden?




2. Pearson’s correlation → it is used when we want to determine the correlation
between two numerical variables given k observations.
➢ It is intended for numerical variables only and its value lies in [-1, 1]
➢ The order of variables does not matter since the coefficient is symmetric.

Example: correlation between age and glucose levels

,The terminology can be different. We use correlation when we are working with numerical
data, and we use association when we are working with categorical data.

3. Association between two categorical variables → sometimes, we need to measure
the association degree between two categorical (ordinal or nominal) variables.
➢ For example, what is the association between gender and eye color?
4. The X2 association measure → it is used when we want to measure the association
between two categorical variables given k observations.
➢ We should compare the frequencies of values appearing together with their
individual frequencies
➢ The first step in that regard would be to create a contingency table.
➢ Let us assume that a categorical variable X involves m possible categories while Y
involves n categories.
➢ The observed value gives how many times each combination was found.
➢ The expected value is the multiplication of the individual frequencies divided by
the number of observations.

Association between gender and eye color




Such an example is very likely for in the exam.

Encoding strategies
Encoding categorical features → some machine learning, data mining algorithms or
platforms cannot operate with categorical features. Therefore, we need to encode these
features as numerical quantities.
1. Label encoding → consists of assigning integer numbers to each category. It only
makes sense if there is an ordinal relationship among the categories.
➢ E.g., weekdays, months, star-based hotel ratings, income
categories.
2. One-hot encoding → is used to encode nominal features that
lack an ordinal relationship. Each category of the categorical
feature is transformed into a binary feature such that one
marks the category.
➢ This strategy often increases the problem dimensionality
notably since each feature is encoded as a binary vector.

, Class imbalance
Sometimes we have problems with much more instances belonging to a decision class than
the other classes.
- In this example, we have more instances labelled with the
negative decision class than the positive one.
Classifiers are tempted to recognize the majority decision class only.

Simple strategies:
1. Under sampling
2. Oversampling
One strategy is to select some instances from the majority decision class,
provided we retain enough instances.
Another method consists of creating new instances belonging to the
minority class (creating random copies)
These strategies are applied to the data when building the model.


SMOTE → synthetic minority oversampling technique. It is a popular
strategy to deal with class imbalance.
- Creates synthetic instances in the neighborhoods of instances
belonging to the minority class.
- Caution is advised since the classifier is forced to learn from
artificial instances, which might induce noise.


Lecture 2
Classification problem
In this problem, we have four categorical (ordinal and nominal) features to be
used to predict the outcome.

We have only two possible outcomes or decision classes (binary problem).
The goal in pattern classification is to build a model to generalize well beyond
the historical training data.

Rule-based learning: in this approach, the classification problem is modelled as
a set of rules involving features and their values in the antecedent of such rules
and decision classes in the consequent.
- Algorithm → decision trees are perhaps the most popular algorithm of this
paradigm.

Written for

Institution
Study
Course

Document information

Uploaded on
March 25, 2023
Number of pages
38
Written in
2022/2023
Type
SUMMARY

Subjects

$7.50
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
sophiedekkers54 Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
46
Member since
7 year
Number of followers
24
Documents
3
Last sold
4 months ago

3.0

2 reviews

5
0
4
0
3
2
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions