Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Class notes

Notes Advanced Data Analysis

Rating
-
Sold
2
Pages
131
Uploaded on
19-02-2024
Written in
2022/2023

This document consists of college notes from the theory lessons supplemented with the explanatory figures and additional information. Therefore, it contains all theory that should be studied for the exam except the practicals.

Institution
Course

Content preview

Inhoudsopgave

Chapter 1: Introduction.................................................................................................................................... 4
1.1: Introduction .......................................................................................................................................... 4
o Before we start............................................................................................................................... 4
§ A few practical things ................................................................................................................. 4
® Background ........................................................................................................................... 4
o A bit of context............................................................................................................................... 4
§ Big data ..................................................................................................................................... 4
® Definition of big data ............................................................................................................. 5
® Big data is characterized by: .................................................................................................. 5
® Large scale data and AI brought a new data intensive research paradigm .............................. 8
§ What is data? Some definitions of what we are dealing with and how we can represent it?........ 8
® Data can be given by objects and attributes ........................................................................... 8
a) Data object....................................................................................................................... 9
b) Attribute .......................................................................................................................... 9
® Dataset types ...................................................................................................................... 10
a) Record:........................................................................................................................... 10
b) Graph: ............................................................................................................................ 11
c) Ordered:......................................................................................................................... 11
§ Data mining ............................................................................................................................. 12
® What is data mining? ........................................................................................................... 12
® Examples: Is it data mining?................................................................................................. 13
® Data mining challenges........................................................................................................ 13
® Major tasks of data mining (after preprocessing) ................................................................. 14
1) Supervised data mining ................................................................................................... 14
2) Unsupervised data mining ............................................................................................... 17
® Data mining is business ....................................................................................................... 18
® Value of data ....................................................................................................................... 19
® Evolution............................................................................................................................. 19

Chapter 2: Processing principles..................................................................................................................... 20
2.1: Processing principles............................................................................................................................ 20
o Introduction ................................................................................................................................. 20
§ What you usually have vs. what you want and need ................................................................. 20
® In reality you usually have ‘dirty data’ .................................................................................. 20
® Data that you actually want/need is: ................................................................................... 20
o Pre-processing and transformation à to get more minable data that can be further used ............ 20
§ Role of pre-processing and transformation............................................................................... 20
® Unstructured data ............................................................................................................... 20
® Common data processing steps that each make data more ready for data mining ................ 21
a) Feature extraction:......................................................................................................... 21
b) Attribute transformation = feature transformation ........................................................ 21
c) Discretization ................................................................................................................. 22
d) Aggregation.................................................................................................................... 22
e) Noise removal ................................................................................................................ 22
f) Identifying outliers à outlier removal ........................................................................... 23
g) Sampling ........................................................................................................................ 23
h) Handling duplicated data ............................................................................................... 24
i) Handling missing values ................................................................................................. 24
j) Dimensionality reduction ............................................................................................... 25
® Processing steps for specific data types: what types of features are we dealing with? .......... 29



1

, a) Image data: .................................................................................................................... 29
b) Survey data .................................................................................................................... 30
c) Sequence data................................................................................................................ 31
d) Text data ........................................................................................................................ 32
e) Omics data ..................................................................................................................... 32
f) Temporal........................................................................................................................ 38

Chapter 3: Unsupervised clustering................................................................................................................ 39
3.1: Unsupervised clustering ....................................................................................................................... 39
o Introduction ................................................................................................................................. 39
§ Unsupervised vs. supervised .................................................................................................... 39
® Quick overview in difference between supervised and unsupervised ................................... 39
§ Clustering ................................................................................................................................ 39
® What is clustering? .............................................................................................................. 39
® Exists in different domains and has different names but it does something quite similar ...... 39
® Natural grouping ................................................................................................................. 39
§ Similarity ................................................................................................................................. 40
® Wat is similarity? ................................................................................................................. 40
® Defining distance measures ................................................................................................. 40
® How do we measure similarity? ........................................................................................... 41
§ Dendrograms ........................................................................................................................... 42
® What is it? ........................................................................................................................... 42
® Example .............................................................................................................................. 42
® Use of dendrograms ............................................................................................................ 44
§ Algorithms ............................................................................................................................... 44
o 2 types of clustering ..................................................................................................................... 45
§ Hierarchical clustering ............................................................................................................. 45
® Principle: ............................................................................................................................. 45
® Heuristic search (= a more practical feasible way come up with the best dendrogram but
without forgetting that there are multiple options out there) ....................................................... 45
à Since we cannot test all possible trees we will have to heuristic search of all possible trees. We
could do this bottom-up or top-down. .......................................................................................... 45
à use a heuristic search à we cannot guarantee we get the optimal solution, but way faster than
testing every option ..................................................................................................................... 45
® How to measure the distance between 2 clusters based on the distance function? .............. 46
§ Partitional clustering ............................................................................................................... 50
® What is it? ........................................................................................................................... 50
® How many clusters? à how to specify k? ............................................................................ 50
® K-means steps (simple & efficient algorithm) ....................................................................... 51
® Importance of choosing initial centroids .............................................................................. 53
® Weakness of k-means.......................................................................................................... 53

Chapter 4: Principal component analysis (PCA) .............................................................................................. 54
4.1: Principal component analysis (PCA) ..................................................................................................... 54
o PCA as the backbone of modern data analysis .............................................................................. 54
§ What is principal component analysis and why is it necessary?................................................. 54
® PCA is the first thing you do when you get a new dataset..................................................... 54
® Reasons to do PCA:.............................................................................................................. 54
® Multivariate data................................................................................................................. 54
§ Important concepts.................................................................................................................. 55
® Basic variable statistics ........................................................................................................ 55
a) Mean .............................................................................................................................. 55
b) Median ........................................................................................................................... 56
c) Range ............................................................................................................................. 56
d) Variance ......................................................................................................................... 56


2

, e) Standard deviation.......................................................................................................... 56
® Data transformation ............................................................................................................ 56
2) Comparing variables ................................................................................................................. 57
o How does PCA work? .................................................................................................................... 58
§ Data projection ........................................................................................................................ 58
® Too many variables ............................................................................................................. 58
® What’s data projection? ...................................................................................................... 59
® Why use projections? .......................................................................................................... 59
® Data visualization and simplification à data projection should capture as much of the
information as possible ................................................................................................................ 60
® Geometric interpretation of PCA ......................................................................................... 60
® PCA output: IMPORTANT for the exam to interpret output ! ................................................ 62
® PCA usage: scores and loadings ........................................................................................... 64
® PCA examples...................................................................................................................... 64
§ t-SNE ..................................................................................... Fout! Bladwijzer niet gedefinieerd.
® = alternative method for data projection ............................................................................. 71
® How? .................................................................................................................................. 72
® Comparison PCA and t-SNE .................................................................................................. 74
® Perplexity ............................................................................................................................ 74
® Example: t-SNE for single cell RNAseq .................................................................................. 74

Chapter 5: Supervised learning ...................................................................................................................... 76
5.1: Supervised learning ............................................................................................................................. 76
o Introduction ................................................................................................................................. 76
§ Classification problem = problem we have a lot of experience with .......................................... 76
® Use features of an object to assign a hopefully correct label to an object ............................. 76
® Pigeon problems: training pigeons to classify paintings ........................................................ 76
® Grasshopper problem: Given a collection of annotated data. In this case 5 Katydids and 5
Grasshoppers, decide what type of insect the unlabeled example is (2 similar, but not identical
animals) ....................................................................................................................................... 76
o Regression vs. classification .......................................................................................................... 78
§ General.................................................................................................................................... 78
® Differences.......................................................................................................................... 78
§ Classification............................................................................................................................ 78
a) Simple linear classifier.................................................................................................... 78
® General: what is a simple linear classifier? ........................................................................... 78
® Support vector machines (SVM)........................................................................................... 82
® Decision value ..................................................................................................................... 83
® Predictive accuracy.............................................................................................................. 84
® Confusion matrix = matrix that fits all of the samples with the classified label vs. the true label
85
® Thresholds and accuracy ..................................................................................................... 86
® ROC and PR curves .............................................................................................................. 87
b) Nearest neighbor classifier ............................................................................................. 90
® What is this type of classifier? ............................................................................................. 90

Chapter 6: Regression .................................................................................................................................... 93
6.1: Regression ........................................................................................................................................... 93
o Regression = a supervised machine learning (ML) model and can be used to analyze multivariate
data (in data science you often need to deal with regression problems BUT this is different from ‘normal’
statistics) ............................................................................................................................................... 93
§ The regression problem ........................................................................................................... 93
® Given a collection of annotated data (in this case a number of insects with their ages), you
need to try to predict a variable about the data ............................................................................ 93
§ Regression vs. classification...................................................................................................... 94


3

, ® Classification....................................................................................................................... 94
® Regression .......................................................................................................................... 94
§ Types of regression .................................................................................................................. 94
® Simple linear regression...................................................................................................... 94
® Multiple linear regression ................................................................................................... 95
® Non-linear regression ......................................................................................................... 98
® Logistic regression .............................................................................................................. 98
® Cox regression .................................................................................................................... 99
® Regularized regression ...................................................................................................... 100
§ Considerations that need to be made with regression ............................................................ 103
® Overfitting......................................................................................................................... 103
- Intuitively we would say 9 ................................................................................................. 103
a) K-fold cross validation .................................................................................................. 104
b) Leave one-out cross validation (CV) = special case of K-fold cross validation when K =
number of samples ................................................................................................................ 105
® Speed and scalability ......................................................................................................... 105
® Interpretability à model interpretability is really important and leads to model transparency
105
® Robustness........................................................................................................................ 106

Chapter 7: Machine learning methods ......................................................................................................... 108
7.1: Machine learning methods ................................................................................................................ 108
o Supervised machine learning methods........................................................................................ 108
§ Recap .................................................................................................................................... 108
® Supervised vs. unsupervised .............................................................................................. 109
§ Classification.......................................................................................................................... 109
® Classification ..................................................................................................................... 109
® Classification algorithms .................................................................................................... 109
a) Support vector machines.............................................................................................. 110
b) Decision trees............................................................................................................... 110
c) Random forest ............................................................................................................. 114
d) Neural networks (NN) and deep learning ...................................................................... 119
e) K-nearest neighbors ..................................................... Fout! Bladwijzer niet gedefinieerd.



Chapter 1: Introduction

1.1: Introduction
• Introduction
o Before we start
§ A few practical things
® Background
¨ Background on bioinformatics, statistics, omics data analysis (NGS,
microarrays, …), data mining and machine learning
o A bit of context
§ Big data
® What is big data?
¨ In the last 5 decades there has been an evolution of the human system:
from seeing the human body from multi-disciplinary perspectives to the
human system as a complex interplay between genes, proteins, small
molecules, … that interact with each other in a very complex way and



4

Written for

Institution
Study
Course

Document information

Uploaded on
February 19, 2024
Number of pages
131
Written in
2022/2023
Type
Class notes
Professor(s)
Kris laukens
Contains
All classes

Subjects

$22.95
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller
Seller avatar
jentebeeldens1

Get to know the seller

Seller avatar
jentebeeldens1 Universiteit Antwerpen
Follow You need to be logged in order to follow users or courses
Sold
2
Member since
2 year
Number of followers
2
Documents
1
Last sold
2 year ago
Biomedische Wetenschappen

Notities, samenvattingen, practicumnota's, ...

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions