First Edition
David M Diez
Postdoctoral Fellow
Department of Biostatistics
Harvard School of Public Health
Christopher D Barr
Assistant Professor
Department of Biostatistics
Harvard School of Public Health
Mine Çetinkaya-Rundel
Assistant Professor of the Practice
Department of Statistics
Duke University
,Copyright © 2011. First Edition: July, 2011.
A PDF of this textbook (OpenIntro Statistics) is also available online for free and is released
by OpenIntro under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Un-
ported (CC BY-NC-ND 3.0) United States license at openintro.org. The editable source
of this book is also available under a Creative Commons Attribution-NonCommercial-
ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) license. Please see creativecommons.org
for details on these licenses.
ISBN: 9-781461-062615
,Contents
1 Introduction to data 1
1.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Observations, variables, and cars . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Data Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Types of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.4 Relationships among variables . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Associated and independent variables . . . . . . . . . . . . . . . . . 8
1.3 Examining numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Scatterplots for paired data . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Dot plots and the mean . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Histograms and shape . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.4 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . 13
1.3.5 Box plots, quartiles, and the median . . . . . . . . . . . . . . . . . . 15
1.3.6 Robust statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.7 Transforming data (special topic) . . . . . . . . . . . . . . . . . . . . 18
1.4 Considering categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 Contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.2 Bar plots and proportions . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.3 Segmented bar and mosaic plots . . . . . . . . . . . . . . . . . . . . 22
1.4.4 The only pie chart you will see in this book . . . . . . . . . . . . . . 24
1.4.5 Comparing numerical data across groups . . . . . . . . . . . . . . . 25
1.5 Overview of data collection principles . . . . . . . . . . . . . . . . . . . . . 26
1.5.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.2 Anecdotal evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.5.3 Sampling from a population . . . . . . . . . . . . . . . . . . . . . . . 28
1.5.4 Explanatory and response variables . . . . . . . . . . . . . . . . . . . 29
1.5.5 Introducing observational studies and experiments . . . . . . . . . . 30
1.6 Observational studies and sampling strategies . . . . . . . . . . . . . . . . . 30
1.6.1 Observational studies . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.6.2 Three sampling methods (special topic) . . . . . . . . . . . . . . . . 31
1.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.7.1 Principles of experimental design . . . . . . . . . . . . . . . . . . . . 34
1.7.2 Reducing bias in human experiments . . . . . . . . . . . . . . . . . . 36
1.8 Case study: efficacy of sulphinpyrazone (special topic) . . . . . . . . . . . . 36
1.8.1 Variability within data . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.8.2 Simulating the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.8.3 Checking for independence . . . . . . . . . . . . . . . . . . . . . . . 39
iii
, iv CONTENTS
1.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.9.1 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.9.2 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.9.3 Examining numerical data . . . . . . . . . . . . . . . . . . . . . . . . 42
1.9.4 Considering categorical data . . . . . . . . . . . . . . . . . . . . . . 48
1.9.5 Overview of data collection principles . . . . . . . . . . . . . . . . . 49
1.9.6 Observational studies and sampling strategies . . . . . . . . . . . . . 50
1.9.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.9.8 Case study: efficacy of sulphinpyrazone . . . . . . . . . . . . . . . . 52
2 Probability (special topic) 55
2.1 Defining probability (special topic) . . . . . . . . . . . . . . . . . . . . . . . 55
2.1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.1.2 Disjoint or mutually exclusive outcomes . . . . . . . . . . . . . . . . 57
2.1.3 Probabilities when events are not disjoint . . . . . . . . . . . . . . . 59
2.1.4 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.1.5 Complement of an event . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.1.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.2 Continuous distributions (special topic) . . . . . . . . . . . . . . . . . . . . 66
2.2.1 From histograms to continuous distributions . . . . . . . . . . . . . . 68
2.2.2 Probabilities from continuous distributions . . . . . . . . . . . . . . 68
2.3 Conditional probability (special topic) . . . . . . . . . . . . . . . . . . . . . 69
2.3.1 Marginal and joint probabilities . . . . . . . . . . . . . . . . . . . . . 70
2.3.2 Defining conditional probability . . . . . . . . . . . . . . . . . . . . . 71
2.3.3 Smallpox in Boston, 1721 . . . . . . . . . . . . . . . . . . . . . . . . 73
2.3.4 General multiplication rule . . . . . . . . . . . . . . . . . . . . . . . 74
2.3.5 Independence considerations in conditional probability . . . . . . . . 75
2.3.6 Tree diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.3.7 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.4 Sampling from a small population (special topic) . . . . . . . . . . . . . . . 81
2.5 Random variables (special topic) . . . . . . . . . . . . . . . . . . . . . . . . 83
2.5.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.5.2 Variability in random variables . . . . . . . . . . . . . . . . . . . . . 85
2.5.3 Linear combinations of random variables . . . . . . . . . . . . . . . . 87
2.5.4 Variability in linear combinations of random variables . . . . . . . . 89
2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.6.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.6.2 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.6.3 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . 96
2.6.4 Sampling from a small population . . . . . . . . . . . . . . . . . . . 100
2.6.5 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3 Distributions of random variables 104
3.1 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.1.1 Normal distribution model . . . . . . . . . . . . . . . . . . . . . . . 105
3.1.2 Standardizing with Z scores . . . . . . . . . . . . . . . . . . . . . . . 106
3.1.3 Normal probability table . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1.4 Normal probability examples . . . . . . . . . . . . . . . . . . . . . . 108
3.1.5 68-95-99.7 rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.2 Evaluating the normal approximation . . . . . . . . . . . . . . . . . . . . . 113