Summary

Samenvatting Data Mining for buisiness and governance

Rating

Sold

Pages

Uploaded on

22-03-2021

Written in

2020/2021

A detailed summary of the course data mining of all lecture slides and notes including examples of the algorithms.

Institution

Course

Content preview

Samenvatting Data Mining.

Week 1.

Video 1: introduction.

Data mining has (main) relations to:

- Knowledge discovery in databases.
- Artificial intelligence.
- Machine learning.
- Stats.

Definition of data mining: data mining is the computational process of discovering patterns in large
datasets involving methods at the intersection of AI, machine learning, stats ,and database systems.

Key aspects of data mining:

- Computation VS large data sets → trade off between processing time and memory.
- Computation enables analyses of large datasets → computers as tools and with growing data.
- Data mining often implies knowledge discovery from data bases → from unstructured data to
structured knowledge

Big data = doing with volume, variety or velocity.

Data mining also can be seen as applied machine learning: it requires skill and as with most skills you
get better with practice and experience.

What makes prediction possible? Associations between features and targets. → numerical: correlation
coefficient, or categorical: mutual information of X1 contains information about value of X2.

There are two main types of learning:

1. Supervised learning (classification/regression).
2. Unsupervised learning (clustering/dimensionality reduction).

Main difference: using labels in you data or not. With unsupervised learning, you have no labels in your
data.

Supervised learning: training data = the portion of the data that you attend. In SL each object is a pair
(input and output). The algorithm searches for a function, and puts this into new data points to get the
output (do this in the test part, which says how good our fit is).

Workflow:

- Collect data.
- Label examples.
- Choose representation.
- Train model(s).
- Evaluate.
1. Collect data: reliability of measurement/privacy and other regulations.
2. Label examples: annotation guidelines.

, 3. Representation: there are features in your data that could be numerical or categorical. Possibly
convert to a feature vector → vector = a fixed size list of numbers → some learning algorithms
require examples represented as vectors.
4. Train: minimizing the difference between the target values in the dataset labels, and the
mapped values of your training examples. Keep some examples for final evaluation: test set.
Use the rest for: learning (training set) tuning (validation set).
Model tuning: is about finding the best values about the hyperparameters of the dataset. For
each value of hyperparameters:
• Apply algorithm to training set to learn.
• Check performance on validation set.
• Find/choose best performing setting.
5. Evaluate: check performance of the model on test set and this is about generalization of the
model: goal is to estimate how well your model will do in the real world. Keep evaluation
realistic.

Correlation coefficient is a measure, which measures the coral relationship between features. When
correlation is 1, line from left to right, when correlation is -1 line from right to left (up → down).

The numerator of correlation coefficient = the covariance, the denominator = product of STD.
covariance = to what extent the features change together. Product of STD = makes correlations
independent of units.

Covariance is a measure of the joint variability two variables (x, y). if these two are showing different
features, then the sign of the covariance shows the tendency of the linear relationship. It’s magnitude
is not easy to interpret. The correlation coefficient is normalized and corresponds to the strengths of
the linear relation.

Pearson’s R only measures linear tendency. If there is a R = 0 then that doesn’t mean these two features
are not related it just means it is not linear. Correlation does not imply causation but it may still enable
prediction.

Correlation VS causation. Possible causal relationships between two events A and B measured by
correlated random variables.

- A causes B.
- B causes A.
- C Causes both A and B.
- The correlation is a coincidence.
- Combination of the above.

Discovery of correlation can suggest a causal relationship but that is not necessarily the case.
Sometimes the causal relationship can only be discovered by an experimental study → looking into
variables, keeping everything else constant, does it change → this is hard.

Linear regression is a type of regression we have. 2 different variables, and we want to come up with
a relationship that predicts one variable, based on another variable.

,Regression analyses describes a relationship between random variables, those variables are IV (input)
and DV (output). In the regression model, the relationship doesn’t need not be in the form of a linear
function. We focus on linear: f(x) = Ax+B.

Regression is an example of supervised learning, another example could be classification we want to
look if data point shows classes and then if you pick out one datapoint, you have to look at to what
class that datapoint falls into. You assign a class to data points. Could be positive and negative class (1,
0). So, classification is about finding classes and related datapoints (naïve bays, KNN).

Decision boundaries are boundaries that distinguish different classes in classification tasks. The
boundaries are not always necessary linear lines. It is considered to be a model between the separation
of the two classes.

An example of unsupervised learning could be dimensionality reduction, or clustering.

Dimension reduction = the process of reducing the number of random variables under consideration,
while updating principle variables divide in:

- Feature selection = variable selection proves of selecting relevant features for use in model
construction.
- Feature extraction = define relevant features.

Clustering = when we have a set of data points but no things are together, we are supposed to group
thing together without having labels.

Video 1.

What is Data Science? Data science is a concept to unify statistics, data analysis and their related
methods in order to understand an analyze actual phenomena with data.

What makes a Data Scientist? Data scientists use their data and analytics ability to find and interpret
rich data sources, manage large amounts of data (…), create visualizations to aid in understanding data,
build mathematical models using the data, and present and communicate the data insights/findings.

There is a lot of related fields:

They all have one commonality: data-driven science.

What is data? You need to make data numerical and binary (1=yes, 0=no). Different units make it hard
to read the data.

Interpreting data.

, For the child whether interpretation: can we think of rules it’s play time?

We want to know if the kid wanted to play, and this is what we
call a target. We’ve been using certain points of info (features)
which we use to predict this target.

Formally:

- We have our data: X (with features, outlook, temp., windy).
- Our data exists of smaller instances, ‘some instance’ is written as x.
- If we want to specifically point at a particular instance (say our first row), we write x1. We can
see our model as a function f, that when given any instance x, gives us a prediction ^y.
- The application of the model to some instance in our data can be written as f(x).
- Our hope is that ^y is the same as our target: y. Y = a given (the truth) and ^y is a prediction.

In realistic cases, it is super important that we evaluate how these models that we make (algorithms),
that we know how they perform, and if they perform well on new data.

How do we know if our model performs well?

- Correct evaluation is incredibly important in data mining.
- We came up with some rules, but how do we know they generalize; if the rules we learned
apply with the same success rate t data where we don’t know what the target is.

For the child play example: we got 5/6 correct. Which means that the model has 83.3% accuracy. But,
did we cover all the predictions/what if we are presented with new conditions? Our rules are probably
too strict. Other than the training data (where we know the labels and determined our rules by), we
also need test data, unseen by us, to evaluate. We can use this test data o evaluate how our model
would perform on new data.

Case: prediction of house pries:

- Would you be able to determine the price of a house? → you need expert knowledge. This is
required by many observations to gain experience.
- Can you come up with a few features to predict the prices of a house?
• Amount of bedrooms;
• Big gardens yes or no;
• Good neighborhood.

How do we evaluate the house price? In the previous example, we had a clear binary prediction. Either
yes, or no. say we need more classes, we would still be predicting a nominal target (order does not
matter). What about a numeric target like house pricing? We can’t say now we got x% out of x%
correct, and therefore, we can’t use accuracy. Now we are more likely interested in how far our
prediction was off from the actual value: this is called error.

Report Copyright Violation

Written for

Institution: Tilburg University (UVT)
Study: Data Science & Society
Course: Data Mining For Business & Governance (880022M6)

All documents for this subject (7)

Document information

Uploaded on: March 22, 2021
Number of pages: 69
Written in: 2020/2021
Type: SUMMARY

Subjects

data mining
algorithsm
k nn
cross validation
large data
association mining
text data
supervised learning
unsupervised learning
dimensionality reduction
regression
classification

$7.19

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

Robinvanheesch

Get to know the seller

Robinvanheesch Tilburg University

View profile

Sold

Member since

11 year

Number of followers

Documents

Last sold

4 year ago

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Robinvanheesch. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.19. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 48077 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Samenvatting Data Mining for buisiness and governance

Content preview

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?