Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
College aantekeningen

Extensive summary of big data lecture and practical notes

Beoordeling
-
Verkocht
5
Pagina's
119
Geüpload op
28-10-2025
Geschreven in
2025/2026

The documents consists of the course summary of the big data course. It includes the theory lecture notes but also the practical notes. Both will be tested on the exam. It is an extensive summary with examples and extra notes, a lot of the things said in the lectures are written down verbatim or summarized in a clear way.

Meer zien Lees minder
Instelling
Vak

Voorbeeld van de inhoud

Lecture 1
Concept of big data depends on:
- What kind of data you’re dealing with
- What resources you have
- What you want to do with the data
⇒ No fixed definition, the concept changes over time
- In the past:
- Storage was expensive
- Only the most crucial data was preserved
- Most companies did no more than consult historical data, rather than analyse it

Storing the data:
- Recent trends:
- Storage is (relatively) cheap and easy
- Companies and governments preserve huge amounts of data
- Easier
- There is a lot more data being generated
- Customer information, historical purchases, click logs, search histories,
patient histories, financial transactions, GPS trajectories, usage logs,
images/ audio/ video, sensor data ⇒ more data being collected
- More and more companies and governments rely on data analysis
- Recommender systems, next event prediction (flood warnings), fraud detection,
predictive maintenance (sensor data, output of machine), image recognition,
COVID contract tracing
⇒ Issues:
- The quantity of data
- The speed with which you have to process the data, to produce outputs

Making data useful:
- However:
- Data analysis is computationally intensive and expensive
- Examples:
- Online recommender systems: require instant results
- Frequent pattern mining: time complexity exponential in the number of different
items, independent of the number of transactions (e.g., market basket analysis)
- Multi-label classification: exponential number of possible combinations of labels
to be assigned to a new sample

So what is big data:
- Dependent on the use case
- Data becomes big data when it becomes too large or too complex to be analyzed with traditional
data analysis software
- Analysis becomes too slow or too unreliable
- Systems become unresponsive (error messages, run out of hard disk space)
- Day-to-day business is impacted

,Three aspects of big data:
- Volume:
- The actual quantity of data that is gathered (gigabytes, etc.) ⇒ how much
data do you have?
- Number of events logged, number of transactions (rows in the data), number of
attributes (columns) describing each event/ transaction
- Can be an issue if there’s too much of it
- Variety:
- The different types of data that is gathered
- Some attributes may be numeric, others textual
- Structured vs unstructured data
- Irregular timing
- Sensor data may come in regular time intervals, accompanying log data
are irregular
- The variety of data makes the analysis more complex and challenging
- Velocity
- The speed at which new data is coming in and the speed at which data must be handled
- The time intervals of which data comes in
- If the data comes in at a higher speed than you can handle then you’ve a
problem
- Two aspects:
- How fast is new data coming in?
- How fast do you need to handle the new data? (how fast do you need to
produce output?)
- May result in irrecoverable bottlenecks

What can we do about it?
- Invest in hardware
- Store more data ⇒ doesn’t necessarily help with sufficiently speeding up the
computations
- Process the data faster
- Typically (sub)linearly faster - doesn’t help much if an algorithm has exponential
complexity
- Exponential complexity = if you have to process 100 data items, it takes
2^100 time units
- With more hardware (2 instead of 1 pc) ⇒ 2^99 ⇒ so still a lot
- Linearly reducing the runtime doesn’t help if the run time is exponential
- Design intelligent algorithms to speed up the analysis
- Specifically make use of available hardware resources
- Provide good approximate results at the fraction of the cost/ time
- Take longer to build a model that can then be used on-the-fly (recommender systems,
precomputed)
- We focus on the latter

,Parallel computing
Goal: leveraging the full potential of your multicore multiprocessor multicomputer system
- If you have to process large amounts of data it would be a shame not to use all n cores of a CPU
- If a single system does not suffice, how can you set up multiple computers so that they work
together to solve a problem? For instance, you can rent a cluster of 100 instances using the cloud
to do some computations that take 10 hours, but then what?

Goal of parallel processing is to reduce computation time (not to simplify the problem)
- Split the problem into smaller parts and assign these smaller parts to different processors/
different machines
- Algorithms are typically designed to solve a problem in serial fashion
- To fully leverage the power of your multicore CPU you need to adapt your algorithm: split
your problem into smaller parts that can be executed in parallel
- We can’t always expect to parallelize every part of the algorithm, however in some cases it is
almost trivial to split the entire problem in smaller parts that can run in parallel, i.e. embarrassingly
parallel
- If an algorithm is embarrassingly parallel, then you can do this and you can achieve
optimal runtime gains
- In that case you can expect to have a linear speedup, i.e. executing two tasks in parallel on two
cores should halve the running time
- E.g., task takes 4 hours, give to 4 different machines ⇒ done in 1 hour ⇒ linear
speed up

Example: adding numbers in parallel




- Want to add 8 numbers ⇒ need 7 operations
- Parallel:
- 4 in first step, 2 in second, 1 in third step
- 7 steps turned into four parallel processes ⇒ takes 3 units of time ⇒ not linear
speedup

, Parallel computation:
- Task parallelism: multiple tasks are applied on the same data in parallel
- E.g., you want to do some analysis/ multiple analysis that are independent
from each other ⇒ parallelise this, every machine same data set, but every
machine will do a different task on the data set
- Imagine you have a book (the dataset).
- Person A counts how many words are in the book
- Person B finds the most common word.
- Data parallelism: a calculation is performed in parallel on many different data chunks
- E.g., you want to do a single task on your big data set, divide the data into chunks, and
give each machine a part of the data so that this same task can be processed on each
machine on a different part of the data
- Each machine (or core) gets just a portion of the dataset, and does the same kind of
work.
- Imagine you have a big stack of 1,000 books (the dataset).
- Person A reads books 1–250 and counts the words.
- Person B reads books 251–500 and counts the words.

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
28 oktober 2025
Aantal pagina's
119
Geschreven in
2025/2026
Type
College aantekeningen
Docent(en)
Boris čule and stijn rotman
Bevat
Alle colleges

Onderwerpen

$9.28
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
StudentSums Erasmus Universiteit Rotterdam
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
47
Lid sinds
5 jaar
Aantal volgers
0
Documenten
6
Laatst verkocht
1 maand geleden

2.5

2 beoordelingen

5
0
4
1
3
0
2
0
1
1

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Bezig met je bronvermelding?

Maak nauwkeurige citaten in APA, MLA en Harvard met onze gratis bronnengenerator.

Bezig met je bronvermelding?

Veelgestelde vragen