Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Summary

Samenvatting Data Mining for buisiness and governance

Rating
-
Sold
6
Pages
69
Uploaded on
22-03-2021
Written in
2020/2021

A detailed summary of the course data mining of all lecture slides and notes including examples of the algorithms.

Institution
Course

Content preview

Samenvatting Data Mining.

Week 1.

Video 1: introduction.

Data mining has (main) relations to:

- Knowledge discovery in databases.
- Artificial intelligence.
- Machine learning.
- Stats.

Definition of data mining: data mining is the computational process of discovering patterns in large
datasets involving methods at the intersection of AI, machine learning, stats ,and database systems.

Key aspects of data mining:

- Computation VS large data sets → trade off between processing time and memory.
- Computation enables analyses of large datasets → computers as tools and with growing data.
- Data mining often implies knowledge discovery from data bases → from unstructured data to
structured knowledge

Big data = doing with volume, variety or velocity.

Data mining also can be seen as applied machine learning: it requires skill and as with most skills you
get better with practice and experience.

What makes prediction possible? Associations between features and targets. → numerical: correlation
coefficient, or categorical: mutual information of X1 contains information about value of X2.

There are two main types of learning:

1. Supervised learning (classification/regression).
2. Unsupervised learning (clustering/dimensionality reduction).

Main difference: using labels in you data or not. With unsupervised learning, you have no labels in your
data.

Supervised learning: training data = the portion of the data that you attend. In SL each object is a pair
(input and output). The algorithm searches for a function, and puts this into new data points to get the
output (do this in the test part, which says how good our fit is).

Workflow:

- Collect data.
- Label examples.
- Choose representation.
- Train model(s).
- Evaluate.
1. Collect data: reliability of measurement/privacy and other regulations.
2. Label examples: annotation guidelines.

, 3. Representation: there are features in your data that could be numerical or categorical. Possibly
convert to a feature vector → vector = a fixed size list of numbers → some learning algorithms
require examples represented as vectors.
4. Train: minimizing the difference between the target values in the dataset labels, and the
mapped values of your training examples. Keep some examples for final evaluation: test set.
Use the rest for: learning (training set) tuning (validation set).
Model tuning: is about finding the best values about the hyperparameters of the dataset. For
each value of hyperparameters:
• Apply algorithm to training set to learn.
• Check performance on validation set.
• Find/choose best performing setting.
5. Evaluate: check performance of the model on test set and this is about generalization of the
model: goal is to estimate how well your model will do in the real world. Keep evaluation
realistic.

Correlation coefficient is a measure, which measures the coral relationship between features. When
correlation is 1, line from left to right, when correlation is -1 line from right to left (up → down).

The numerator of correlation coefficient = the covariance, the denominator = product of STD.
covariance = to what extent the features change together. Product of STD = makes correlations
independent of units.

Covariance is a measure of the joint variability two variables (x, y). if these two are showing different
features, then the sign of the covariance shows the tendency of the linear relationship. It’s magnitude
is not easy to interpret. The correlation coefficient is normalized and corresponds to the strengths of
the linear relation.

Pearson’s R only measures linear tendency. If there is a R = 0 then that doesn’t mean these two features
are not related it just means it is not linear. Correlation does not imply causation but it may still enable
prediction.

Correlation VS causation. Possible causal relationships between two events A and B measured by
correlated random variables.

- A causes B.
- B causes A.
- C Causes both A and B.
- The correlation is a coincidence.
- Combination of the above.

Discovery of correlation can suggest a causal relationship but that is not necessarily the case.
Sometimes the causal relationship can only be discovered by an experimental study → looking into
variables, keeping everything else constant, does it change → this is hard.

Linear regression is a type of regression we have. 2 different variables, and we want to come up with
a relationship that predicts one variable, based on another variable.

,Regression analyses describes a relationship between random variables, those variables are IV (input)
and DV (output). In the regression model, the relationship doesn’t need not be in the form of a linear
function. We focus on linear: f(x) = Ax+B.

Regression is an example of supervised learning, another example could be classification we want to
look if data point shows classes and then if you pick out one datapoint, you have to look at to what
class that datapoint falls into. You assign a class to data points. Could be positive and negative class (1,
0). So, classification is about finding classes and related datapoints (naïve bays, KNN).

Decision boundaries are boundaries that distinguish different classes in classification tasks. The
boundaries are not always necessary linear lines. It is considered to be a model between the separation
of the two classes.

An example of unsupervised learning could be dimensionality reduction, or clustering.

Dimension reduction = the process of reducing the number of random variables under consideration,
while updating principle variables divide in:

- Feature selection = variable selection proves of selecting relevant features for use in model
construction.
- Feature extraction = define relevant features.

Clustering = when we have a set of data points but no things are together, we are supposed to group
thing together without having labels.

Video 1.

What is Data Science? Data science is a concept to unify statistics, data analysis and their related
methods in order to understand an analyze actual phenomena with data.

What makes a Data Scientist? Data scientists use their data and analytics ability to find and interpret
rich data sources, manage large amounts of data (…), create visualizations to aid in understanding data,
build mathematical models using the data, and present and communicate the data insights/findings.

There is a lot of related fields:




They all have one commonality: data-driven science.

What is data? You need to make data numerical and binary (1=yes, 0=no). Different units make it hard
to read the data.

Interpreting data.

, For the child whether interpretation: can we think of rules it’s play time?

We want to know if the kid wanted to play, and this is what we
call a target. We’ve been using certain points of info (features)
which we use to predict this target.




Formally:

- We have our data: X (with features, outlook, temp., windy).
- Our data exists of smaller instances, ‘some instance’ is written as x.
- If we want to specifically point at a particular instance (say our first row), we write x1. We can
see our model as a function f, that when given any instance x, gives us a prediction ^y.
- The application of the model to some instance in our data can be written as f(x).
- Our hope is that ^y is the same as our target: y. Y = a given (the truth) and ^y is a prediction.

In realistic cases, it is super important that we evaluate how these models that we make (algorithms),
that we know how they perform, and if they perform well on new data.

How do we know if our model performs well?

- Correct evaluation is incredibly important in data mining.
- We came up with some rules, but how do we know they generalize; if the rules we learned
apply with the same success rate t data where we don’t know what the target is.

For the child play example: we got 5/6 correct. Which means that the model has 83.3% accuracy. But,
did we cover all the predictions/what if we are presented with new conditions? Our rules are probably
too strict. Other than the training data (where we know the labels and determined our rules by), we
also need test data, unseen by us, to evaluate. We can use this test data o evaluate how our model
would perform on new data.

Case: prediction of house pries:

- Would you be able to determine the price of a house? → you need expert knowledge. This is
required by many observations to gain experience.
- Can you come up with a few features to predict the prices of a house?
• Amount of bedrooms;
• Big gardens yes or no;
• Good neighborhood.

How do we evaluate the house price? In the previous example, we had a clear binary prediction. Either
yes, or no. say we need more classes, we would still be predicting a nominal target (order does not
matter). What about a numeric target like house pricing? We can’t say now we got x% out of x%
correct, and therefore, we can’t use accuracy. Now we are more likely interested in how far our
prediction was off from the actual value: this is called error.

Written for

Institution
Study
Course

Document information

Uploaded on
March 22, 2021
Number of pages
69
Written in
2020/2021
Type
SUMMARY

Subjects

$7.19
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
Robinvanheesch

Get to know the seller

Seller avatar
Robinvanheesch Tilburg University
Follow You need to be logged in order to follow users or courses
Sold
7
Member since
11 year
Number of followers
5
Documents
5
Last sold
4 year ago

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions