Week 1: Intro of statistics and optimisation
Statistics: tools to understand data
- Data collection
- Estimators
- Hypothesis testing
- Confidence intervals
Optimisation: tools to make decisions
- Linear program LP
- Integer programming
- Dynamic programming
Probability and statistics
- A pail of black and white balls, draw a handful
- Probability: given information in the pail, what is in your hand?
- Statistics: given information in your hand, what is in the pail?
Representative sample
- The sample should be representative of the population
What is statistics?
- Statistics is a discipline of science that makes inference and characterises information which
is usually represented in the form of data
- 2 branches of statistics
o Descriptive stats
o Inferential stats
Descriptive stats
- Collect data: transaction
- Present data: tables and graphs
- Characterise data: sample mean
Inferential statistics
- Estimation: eg estimate the population mean weight using the sample mean weight
- Hypothesis testing
o Drawing conclusions and/or making decisions concerning a population based on
sample results
o Test the claim about the population mean weight
Source of data
- Data distributed by an organisation or an individual
- A survey
- A designed experiment
- Transaction records
, - Web crawler
Types of data
- Categorical: car type, status exterior
- Numerical
o Discrete: number of cars (countable)
o Continuous: fuel level (measured characteristics)
Present data
- Dot plot
- Box plot
- Histogram
- Scatter diagram
- Time series plot
Characterise data
- Sample: sizes of first 5 pairs of shoes sold today. 5 5 6 7 8
- Summarising the center
o Mean: 6.2
o Median: 6
o Mode: 5
- Summarising the spread
o Variance: 1.7
o Data of yesterday: 2 3 8 9 9. VAR() = 11.7
Discrete
- Bernoulli
o Outcome of one experiment with success probability p
o Flip a coin
- Binomial
o Number of successes in n independent trails with success probability p
o Flip a coin 100 times
- Poisson
o Number of event occurrences during a time period
o Orders received between 8am to 9am
Continuous
- Uniform
o Parameter: range
- Exponential
o Parameter: rate
o The inter arrival time between orders
- Normal
o Parameters: mean and variance
Distributions from Normal
, - Chi-squared distribution
o Sum of the squares of k independent standard normal random variables
- Student’s t-distribution
o Arises when estimating the mean of a normally distributed population in situations
where the sample size is small and population standard deviation in unknown
- F-distribution
o Arises frequently in the analysis of variance
Sampling distributions
- Sample statistic
o Mean, variance
- Sampling distribution
o The distribution of sample mean
- Central limit theorem
o As the sample size increases, the sampling distribution of sample means tend
toward normal distribution
o What’s the mean and sd of that sampling distribution
Parameter estimation
- Parameter
o Success probability p in Bernoulli distribution
- Estimation
o The parameter is unknown and estimated based on the sample
- Point estimate
o Method of moments
o Maximum likelihood estimator
- Interval estimate
o Confidence interval
o Chance that the interval contains the true parameter
Evaluating a point estimate
- Estimating a point is like shooting a target
o Unbiased and precise
o Unbiased but imprecise
o Precise but biased
o Biased and imprecise
, Confidence interval
- Shoe size from the order is normal with variance 4
- 9 shoes orders with sizes: 5 8.5 12 15 7 9 7.5 6.5
- Construct 99% two-sided confidence interval
- Solution
o Sample mean = 9
o Population sd = 2
o n=9
o α/2 = 0.005
o Z(0.005) = 2.58
o 99% two-sided confidence interval = 9 ± 1.72
Claims
- Market manager says
o The email targeted offer increases the shoe sales
- How do we examine the claim?
o Hypothesis testing
o Null hypothesis H0
Email targeted offer doesn’t increase the sales
o Alternative hypothesis H1
o The idea: find significant evidence to reject H0
o The philosophy
“presumption of innocence”
“conviction by evidence”
Hypothesis testing
- Set up H0 and H1
- Collect a random sample and compute the appropriate sample test statistic
- Determine the sampling distribution under H0
- Calculate p-value
o The probability of getting at least as extreme a sample result as the one actually
observed if H0 is true
- Compare p-value with significance level of the test and determine rejection