Confidence Confident
Moment of lecture @February 14, 2024
Review @February 17, 2024
Materials advdatanalysis-01-introduction.pdf
Last Edited @March 12, 2025 10:32 AM
People always looked at the human body in a disciplinary way, they have their
own perspective but comprehend with each other. This is not changed but a
new perspective is emerging such as large scale data. New technologies have
become available which is a new perspective. This new perspective requires
new techniques to look into but they all fit into ‘big data’ (nowadays: deep
learning, AI, etc).
People refer to big data as data for which conventional computer-techniques
are not sufficient anymore (in example, size). The present tools will need to
solve more complex problems which means people need to be smarter with the
tools. Big data is also considered as a disruptive trend in computer sciences.
Big data is characterized by 4 main aspects:
Volume: the amount of data you’re dealing with. Example: having a genome
on paper, stacks and stacks of data
Moore's law is a prediction that the number of transistors on a
chip doubles every two years, making computers faster and cheaper.
Velocity: the speed at which data is produced/collected and the fact that it
is produced all the time, machines are producing data all the time. There is
data everywhere and it changes our world.
For example: smartphones have a massive amount of data that it holds
at all times
There is need for new, effective, high-tech data transfer approach
The speed increases faster then the staff
Introduction 1
, Variety: in life sciences there are different data types/data sets. A
distinction is made between structured and unstructured data → 80% of the
data is unstructured. Life sciences have much more variability in the data
that is collected.
Examples: DNA sequencing, morphology, metabolic data, protein
structures, etc
Transcriptome is more variable than the genome
Veracity: the data is never perfect (for example: noise, biases, missing
points) and it is problematic in life sciences because it is present almost
everywhere (it is also present in other aspects of life but always in life
sciences).
⇒ Large scale data and AI brought a new data intensive research paradigm. A
lot of science nowadays is started from data from which predictions and
hypothesis are made. Mostly the paradigm during the research shifts.
Terminology
Data = collection of objects (known as record, point, case, etc) and their
attributes, objects could be the samples and the attributes could be the
measurements performed on the objects but also a feature or a variable.
Attributes = property or characteristic of an object. A collection of attributes
describes the object → more attributes means more knowledge about the
object.
Example: student = object, attributes of student are grades, student number,
etc
It is typical to have the objects in rows and attributes in columns.
Introduction 2
, An attribute value = numbers or symbols assigned to an attribute. Examples of
difference with attributes:
Same attribute can be mapped to different attribute values: height can be
measured in feet or meters
Different attributes can be mapped to the same set of values: attribute value
for ID and age are integers (= gehele getallen)
→ Properties of attribute values can still be different (ID number has no limit but
age does).
There are different types of attributes:
Nominal → only has the distinction mathematical property
Examples: ID numbers, eye color, zip codes
Ordinal → has both the distinction and order mathematical property
Examples: rankings (e.g., taste of potato chips on a scale from 1-10),
grades, height in {tall, medium, short}
Interval → has the distinction, order or addition mathematical properties
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio → has all 4 mathematical properties
Examples: temperature in Kelvin, length, time, counts
⇒ Distinction is based on mathematical properties they have: distinction, order,
addition and/or multiplication.
Introduction 3
, You can also make a distinction between discrete and continuous attribute
(discrete is an integer and continuous is a real number which means it can have
a comma):
Discrete Attribute: has only a finite or countable infinite set of values. Often
represented as integer variables.
Examples: zip codes, counts, or the set of words in a collection of
documents
Continuous Attribute: has real numbers as attribute values. Practically, real
values can only be measured and represented using a finite number of
digits. Continuous attributes are typically represented as floating-point
variables.
Examples: temperature, height, or weight.
Dataset types
There are 3 main types of datasets:
Record data
Graph data
Ordered data
Record data
= data that consists of a collection of records, each of which consists of a fixed
set of attributes.
Data Matrix
Introduction 4