Descriptive Statistics Statistical Inference Simple Regression • 68% confidence interval for β1:
b1 ± SE(b1)
• Observation: One instance of • Standard error: Standard deviation / √n To examine the relationship The population coefficient β1
the thing you are examining The standard deviation of the mean between two or more variables and belongs to (b1 - SE(b1), b1 +
(i.e. a person) value for different samples make predictions SE(b1)) 68% of the time
• Variable: A characteristic that is ○ Large sample size → The average __ • Y: Dependent/outcome variable • 95% confidence interval for β1:
measured about an varies less from sample to sample → • X: Independent/explanatory b1 ± 2 × SE(b1)
observation (i.e. a person’s Smaller standard error variable The population coefficient β1
income) ○ Small sample size → The average __ Ŷ = b0 + b1X belongs to (b1 – 2 × SE(b1), b1
• Data Types: varies more from sample to sample → + 2 × SE(b1)) 95% of the time
Y = b0 + b1X + e
Entities Time Greater standard error
• Hypothesis testing:
Different • Central limit theorem: The sample • Residual/Error: The difference
A single point ○. Null hypothesis (H0): The
Cross-sectional (i.e. people,
or period means are distributed according to the between the observed value of Y
firms) variable X has no impact on Y
One given normal distribution curve and the predicted value of Y H0: β1 = 0 (coefficient is zero)
(i.e. a Observed ○ The average of the sample means is e=Y–Ŷ
Time-series
company’s over time ○ Alternative hypothesis (H1):
the population mean e = Y – (b0 + b1X)
stock price) The variable X has an impact
Different ○ The standard deviation of the sample
Panel/
Different ○ Positive e: The model under- on Y
(i.e. people, periods of means is the standard error predicts
Longitudinal firms) time H1: β1 ≠ 0 (coefficient is
• Hypothesis testing: ○ Negative e: The model over- statistically significant)
• Mean/Expected Value: ○. Null hypothesis (H0): No difference
predicts Method #1 – t-stat:
=average(a2:a64) between the population means of the
Tells how far from zero our
Sensitive to outliers two groups of interest • Categorical variables: sample coefficient is in
• Median: =median(a2:a64) H0: μ(___) = __ (population mean) ○ Reference/comparison standard error units
Not sensitive to outliers ○ Alternative hypothesis (H1): Difference category: Category that equals 0 t = (b1 – β1, hypothesized) / SE(b1)
• Variance: =var(a2:a64) between the two population means ○ b1: The average difference in Y Excel: t = b1 / SE(b1)
How spread out the H1: μ(___) ≠ __ (population mean) between __ and __ ○ If |t|>2: Reject H0 (coefficient
observations are around the Method #1: ○ b0: The average Y of __ is statistically significant)
mean t = (sample mean – hypothesized (category that equals 0) Method #2 – p-value:
• Standard deviation: population mean) / standard error ○ b0 + b1: The average Y of __ The probability that an
=stdev(a2:a64) ○ If |t|>2: Reject H0
(category that equals 1) observed significance occurred
How spread out the data are ○ If |t|=<2: Do not reject H0
just by random chance
around the mean Method #2: • Linear regression model for the ○ If p-value < 0.05: Reject H0
The average distance between 95% confidence interval: sample mean ± whole population: (coefficient is statistically
observations of a variable X 2 × standard error
Y = β0 + β1X + ε significant)
and the mean of the variable ○ If population mean does not belong to
Method #3 – confidence
• Range: =max(a2:a64)- confidence interval: Reject H0 ○ β1: True/population coefficient
interval:
min(a2:a64) ○ If population mean belongs to
○ If 95% confidence interval
• 25th percentile: confidence interval: Do not reject H0 • Standard error of the coefficient
does not contain 0: Reject H0
=percentile.inc(a2:a64,0.25) SE(b1): An estimate of the
(coefficient is statistically
• 75th percentile: Graphs & Charts dispersion (standard deviation)
significant)
=percentile.inc(a2:a64,0.75) we would observe in the
• Distribution: Graphs the coefficient estimates if we were
• Histogram: To see how a single variable Multiple Regression
number or percentage of to select a series of random
is distributed, how concentrated/spread
instances that the variable samples of the same size from
out the data is, at what levels there are a To separate the effects of
takes each value the population
lot of observations different variables on the
x-axis: Variable being observed ○ Small SE(b1): Estimate is close
• Pivot table: To create descriptive outcome variable
y-axis: Frequency (percentage) to the true population coefficient
statistics when some or all of your data ”keeping all other explanatory
of each category of the ○ Large SE(b1): Cannot trust the
are categorical variables in the model constant”
variable estimated coefficient
• Bar graph: To convey different
Area beneath adds up to 1 Determinants of the accuracy of • Categorical variables:
proportions or values across two or more
(100%) our estimates: ○ Reference/comparison
categories
o Symmetric: 1) Sample size category: Excluded category
• Line graph: To show how a variable
mean = median ○ Larger sample size →
changed either over time or as another ○ n categories → n-1 indicator
o Skewed to the left: Tail to Smaller SE(b1) variables
variable changes
left of peak is longer than ○ Smaller sample size → ○ b1: The average difference in
• Scatterplot: To examine the relationship
tail to right of peak Smaller SE(b1) Y between __ (indicator
between two quantitative variables
Generally mean < median 2) Fitting regression variable name) and __
• Tips for good graphs:
o Skewed to the right: Tail to ○ Better fitting regression → (reference category)
○ Change the chart label to be more
right of peak is longer than Points are close to regression
descriptive
tail to left of peak line → Lower chance of
○ Label the axes • Selection bias: When the
Generally mean > median getting a quirky sample →
○ Change the number of bins to no sample we are analyzing does
o Uniform: Flat; Bars have Smaller SE(b1)
more than 12 not represent the entire
same height 3) Variability in X
○ Choose good intervals, gap widths, population of interest
o Bimodal: Two peaks ○ More variability in X →
bar widths ○ Self-selection bias: When
o Normal (Bell curve): Variations in X can be used
○ Order the bars by size participants choose whether or
o Approximately 68% of to explain variations in Y →
○ Display units not to participate in a survey
the observations are Smaller SE(b1)
○ Delete legend or research project and the
within 1 standard ○ Delete grand total group that chooses to
deviation of the mean
participate is not equivalent to
o Approximately 95% of
Correlation • Explanations for correlation the group that opts out
the observations are
between X and Y: ○ Non-representative sample:
within 2 standard
• Correlation coefficient: =correl(a1:a50,b1:b50) 1) X causes Y When the sample does not
deviations of the mean
Measures how closely two variables are related 2) Y causes X (reverse causality) serve as a typical example of
o Approximately all
1) The direction of association 3) X causes Y and Y causes X the population
(99.7%) of the
0: No correlation; no linear relationship (simultaneity) ○ Survivorship bias:
observations are within
1: Perfect positive correlation 4) Z causes both X and Y Concentrating only on the
3 standard deviations
-1: Negative positive correlation (confounding factor) observations that made it past
of the mean
2) The strength of the association (Closer to 1 Correlation does not imply some selection process and
or -1 → Stronger correlation) causation overlooking those that didn’t
b1 ± SE(b1)
• Observation: One instance of • Standard error: Standard deviation / √n To examine the relationship The population coefficient β1
the thing you are examining The standard deviation of the mean between two or more variables and belongs to (b1 - SE(b1), b1 +
(i.e. a person) value for different samples make predictions SE(b1)) 68% of the time
• Variable: A characteristic that is ○ Large sample size → The average __ • Y: Dependent/outcome variable • 95% confidence interval for β1:
measured about an varies less from sample to sample → • X: Independent/explanatory b1 ± 2 × SE(b1)
observation (i.e. a person’s Smaller standard error variable The population coefficient β1
income) ○ Small sample size → The average __ Ŷ = b0 + b1X belongs to (b1 – 2 × SE(b1), b1
• Data Types: varies more from sample to sample → + 2 × SE(b1)) 95% of the time
Y = b0 + b1X + e
Entities Time Greater standard error
• Hypothesis testing:
Different • Central limit theorem: The sample • Residual/Error: The difference
A single point ○. Null hypothesis (H0): The
Cross-sectional (i.e. people,
or period means are distributed according to the between the observed value of Y
firms) variable X has no impact on Y
One given normal distribution curve and the predicted value of Y H0: β1 = 0 (coefficient is zero)
(i.e. a Observed ○ The average of the sample means is e=Y–Ŷ
Time-series
company’s over time ○ Alternative hypothesis (H1):
the population mean e = Y – (b0 + b1X)
stock price) The variable X has an impact
Different ○ The standard deviation of the sample
Panel/
Different ○ Positive e: The model under- on Y
(i.e. people, periods of means is the standard error predicts
Longitudinal firms) time H1: β1 ≠ 0 (coefficient is
• Hypothesis testing: ○ Negative e: The model over- statistically significant)
• Mean/Expected Value: ○. Null hypothesis (H0): No difference
predicts Method #1 – t-stat:
=average(a2:a64) between the population means of the
Tells how far from zero our
Sensitive to outliers two groups of interest • Categorical variables: sample coefficient is in
• Median: =median(a2:a64) H0: μ(___) = __ (population mean) ○ Reference/comparison standard error units
Not sensitive to outliers ○ Alternative hypothesis (H1): Difference category: Category that equals 0 t = (b1 – β1, hypothesized) / SE(b1)
• Variance: =var(a2:a64) between the two population means ○ b1: The average difference in Y Excel: t = b1 / SE(b1)
How spread out the H1: μ(___) ≠ __ (population mean) between __ and __ ○ If |t|>2: Reject H0 (coefficient
observations are around the Method #1: ○ b0: The average Y of __ is statistically significant)
mean t = (sample mean – hypothesized (category that equals 0) Method #2 – p-value:
• Standard deviation: population mean) / standard error ○ b0 + b1: The average Y of __ The probability that an
=stdev(a2:a64) ○ If |t|>2: Reject H0
(category that equals 1) observed significance occurred
How spread out the data are ○ If |t|=<2: Do not reject H0
just by random chance
around the mean Method #2: • Linear regression model for the ○ If p-value < 0.05: Reject H0
The average distance between 95% confidence interval: sample mean ± whole population: (coefficient is statistically
observations of a variable X 2 × standard error
Y = β0 + β1X + ε significant)
and the mean of the variable ○ If population mean does not belong to
Method #3 – confidence
• Range: =max(a2:a64)- confidence interval: Reject H0 ○ β1: True/population coefficient
interval:
min(a2:a64) ○ If population mean belongs to
○ If 95% confidence interval
• 25th percentile: confidence interval: Do not reject H0 • Standard error of the coefficient
does not contain 0: Reject H0
=percentile.inc(a2:a64,0.25) SE(b1): An estimate of the
(coefficient is statistically
• 75th percentile: Graphs & Charts dispersion (standard deviation)
significant)
=percentile.inc(a2:a64,0.75) we would observe in the
• Distribution: Graphs the coefficient estimates if we were
• Histogram: To see how a single variable Multiple Regression
number or percentage of to select a series of random
is distributed, how concentrated/spread
instances that the variable samples of the same size from
out the data is, at what levels there are a To separate the effects of
takes each value the population
lot of observations different variables on the
x-axis: Variable being observed ○ Small SE(b1): Estimate is close
• Pivot table: To create descriptive outcome variable
y-axis: Frequency (percentage) to the true population coefficient
statistics when some or all of your data ”keeping all other explanatory
of each category of the ○ Large SE(b1): Cannot trust the
are categorical variables in the model constant”
variable estimated coefficient
• Bar graph: To convey different
Area beneath adds up to 1 Determinants of the accuracy of • Categorical variables:
proportions or values across two or more
(100%) our estimates: ○ Reference/comparison
categories
o Symmetric: 1) Sample size category: Excluded category
• Line graph: To show how a variable
mean = median ○ Larger sample size →
changed either over time or as another ○ n categories → n-1 indicator
o Skewed to the left: Tail to Smaller SE(b1) variables
variable changes
left of peak is longer than ○ Smaller sample size → ○ b1: The average difference in
• Scatterplot: To examine the relationship
tail to right of peak Smaller SE(b1) Y between __ (indicator
between two quantitative variables
Generally mean < median 2) Fitting regression variable name) and __
• Tips for good graphs:
o Skewed to the right: Tail to ○ Better fitting regression → (reference category)
○ Change the chart label to be more
right of peak is longer than Points are close to regression
descriptive
tail to left of peak line → Lower chance of
○ Label the axes • Selection bias: When the
Generally mean > median getting a quirky sample →
○ Change the number of bins to no sample we are analyzing does
o Uniform: Flat; Bars have Smaller SE(b1)
more than 12 not represent the entire
same height 3) Variability in X
○ Choose good intervals, gap widths, population of interest
o Bimodal: Two peaks ○ More variability in X →
bar widths ○ Self-selection bias: When
o Normal (Bell curve): Variations in X can be used
○ Order the bars by size participants choose whether or
o Approximately 68% of to explain variations in Y →
○ Display units not to participate in a survey
the observations are Smaller SE(b1)
○ Delete legend or research project and the
within 1 standard ○ Delete grand total group that chooses to
deviation of the mean
participate is not equivalent to
o Approximately 95% of
Correlation • Explanations for correlation the group that opts out
the observations are
between X and Y: ○ Non-representative sample:
within 2 standard
• Correlation coefficient: =correl(a1:a50,b1:b50) 1) X causes Y When the sample does not
deviations of the mean
Measures how closely two variables are related 2) Y causes X (reverse causality) serve as a typical example of
o Approximately all
1) The direction of association 3) X causes Y and Y causes X the population
(99.7%) of the
0: No correlation; no linear relationship (simultaneity) ○ Survivorship bias:
observations are within
1: Perfect positive correlation 4) Z causes both X and Y Concentrating only on the
3 standard deviations
-1: Negative positive correlation (confounding factor) observations that made it past
of the mean
2) The strength of the association (Closer to 1 Correlation does not imply some selection process and
or -1 → Stronger correlation) causation overlooking those that didn’t