Statistics: statistics is concerned with the use of data in the context of uncertainty, branch of
mathematics that uses probability theory
Computer science/ computing science (CS): fundamental concepts to understand and explore the
natural and artificial world in computational terms
Computational statistics: implementing statistical methods on computers, including the ones
unthinkable before the computer age, as well as to cope with analytically intractable problems
Basics in probability & statistics:
Data = fundamental to research/ learning:
- Science/ learning relies on data
- Making sense of data & predicting future values is fundamental to statistics & data science
Experiment 1: Bernoulli trial:
- Bernoulli trial = experiment or observation that has exactly two possible outcomes, usually
success and fail (binary)
- Each trial has only two outcomes
- Has a fixed probability of success, say p
- Is independent of other trials
- What we want to learn: when tossing a coin, what is the probability of observing heads?
- Approach 1 = empirical approach
- Gather data
- Estimate the probability: what is the proportion of instances with a hit
- Approach 2 = mathematical approach: can we set up a mathematical model for the experiment?
- This is, can we express the data generating mechanism as a mathematical, more
precisely, statistical model?
- To do this, carefully consider the data (“observations”) that arise and how they arise by
the experiment
- Observations: “hit” (=heads/ 1) or “fail” (=tails/0) ⇒ (the data)
- Experiment: toss a coin ⇒ (the generating mechanism)
- Note: the outcome of the experiment is uncertain (random experiment)
, - A binary random variable follows a Bernoulli distribution
- X is a Bernoulli random variable with success probability pi
Experiment 1: comparison empirical to mathematical result:
- Probability of observing heads when tossing a coin
- Based on the actual experiment; P(X=1) = (in class 0.4)
- Based on the mathematical model and assuming a fair coin; P(X=1) = 0.5
Experiment 2: Binomial trial:
- Binomial trial = a sequence of n independent Bernoulli trials, each with the same probability of
success p
- E.g.,: flipping a coin 10 times and counting how many heads you get
- What we want to learn:
- When tossing a coin five times, what is the probability of observing exactly four heads?
- Approach 1 = empirical approach: let’s gather data
- Toss a coin 5 times and register the number of times head turns up
- Estimate the probability: what is the proportion of instances with four times head
- Based on the experiment, what is the estimated probability of observing 4 times head
tossing a coin five times? P(x=4)= …
- How confident are you in this result? Why?
- Approach 2 = mathematical approach: can we set up a mathematical model for the experiment?
- This is, can we express the data generating mechanism as a mathematical (more
precisely, statistical) model?
- To do this, carefully consider the data (“observations”) that arise and how they arise by
the experiment
- Observations: number of heads in 5 coin tosses (the data)
- Experiment: toss a coin 5 times in a row (the generating mechanism)
- Let’s look at the data generating mechanism in more detail
- Observed data = number of heads when tossing a coin 5 times:
, - The outcome of this experiment is based on counting the number of hits
when repeating a Bernoulli trial n (here n = 5) times
- These Bernoulli trials are independent (meaning…)
- The outcome of such an experiment based on counting the successes
(ones) of n independent Bernoulli trials follows a Binomial distribution
Binomial distribution:
- Not one coin toss, but repeating the same experiment n times
- Each trial:
- Has two outcomes
- Has the same success probability pi
- Is independent of the others (trials are not correlated, previous trial does not affect next
trial)
- X = number of successes in n trials
- Binomial coefficient = this counts how many different ways those k successes can occur among n
trials
- For example, n = 3 and k = 2
- Success pattern: SSF, SFS, FSS
- There are 3 ways: 3! / 2!(3-2)! = 3
- X~Bin(n,pi) = X follows a binomial distribution with n trials and success probability pi per trial
- If you toss a fair coin n=5 => X~Bin(5, 0.5)
- P(X=2) = probability of getting exactly 2 heads in 5 tosses
- 5! / 2!3! (0.5)^2 (0.5)^3 = 10*0.03125 = 0.3125
Experiment 2:
- When tossing a coin five times, what is the probability of observing four times head?
- Based on the experiment: P(4H | 5 tosses) = …
- Now calculate the analytic solution, this is the mathematical solution
- Probability theory: P(4H | 5 tosses) = …
, - How many experiment trials do you need for a “good” guess of the solution? How practical is the
(real) experiment procedure?
Experiment 3: Hypergeometric trial
- What we want to learn: is it likely that one is able to taste whether the sample of chocolate is
Belgian or not?
- Mathematical approach: let’s test the hypothesis of random guessing
- P(X correct | guessing) = … (in class activity)
- Not binomial, since there is dependence on the previous answer
Experiment 4: 3-door problem
- Approach 1 = empirical approach: gather data
- From the data we learn that there were almost twice as much winners among those who
switched
- Drawback: can be expensive to collect data
- Approach 2 = mathematical statistics:
- Using laws of probability (product rule):
- P(win | stay) = P(winning door at first) = ⅓
- P(win | switch) = P(non-winning door at first) x P(winning door at second) = ⅔
*1/1 = ⅔
- Drawback: some problems (too) difficult to solve in an analytic way
- If it’s difficult to get real data ⇒ rely on simulation