lOMoARcPSD|4942262
Bstats Notes
Business Statistics (University of Technology Sydney)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Shebnoor Ahmed ()
, lOMoARcPSD|4942262
LECTURE 1: INTRODUCTION AND DESCRIPTIVE
S TAT I S T I C S I
TYPES OF DATA
QUALITATIVE/CATEGORICAL
Mutually exclusive labels (one label cannot mean two things)
Not often numbers, if so, numbers have no mathematical meaning
- Nominal: ordering/ranking makes no sense, numerical labels are arbitrary
- Ordinal: ordering/ranking has meaning/can be interpreted, numerical labels respect
the ordering
QUANTITATIVE/NUMERICAL
Numbers used to record certain events, numbers have mathematical meaning
- Interval: quantity in difference is meaningful, but in ratio is not; zero has no natural
meaning
- Ratio: difference and ratio of two quantities is also meaningful; zero is meaningful
WORKING WITH CATEGORICAL DATA
Intuitive to tabulate and visualise, technique is frequency distribution
Frequency counts: total no of occurrences for each category
Relative frequency: fraction/proportion of the total no of data items belonging to that
category
Percent frequency: relative frequency x 100 (%)
Excel function COUNTIF, technique to use is frequency counts
To visualise: histogram (categories on x-axis, frequency/relative frequency/percent
frequency on y-axis) or pie chart
INTERMEZZO: THE LANGUAGE
Random variable (r.v.): a variable whose value appears randomly
- usually denoted by capital letters
- Realisations/observations of an r.v. are denoted by lowercase letters
- e.g. N and n denote the size/number of observations - N is referred to population size,
n denotes sample size (no of data points collected in a sample)
Population: collection of people, objects or items of interest; complete pool of certain
random variable
Sample: subset of a population; random collection of a certain size from the population
Probability distribution: general shape of probability for values that a random variable
may assume
DESCRIPTIVE STATISTIC: CENTRAL TENDENCY
Downloaded by Shebnoor Ahmed ()
, lOMoARcPSD|4942262
Measure of central tendency yields info about the centre of a set of numbers (distribution
of a r.v.’s) – does not focus on the span of the dataset or how far values are from middle
numbers
gives an idea of what a typical, middle, or average that a r.v. can take
sometimes called measures of location
THREE MEASURES OF CENTRAL TENDENCY
Mode - most frequently occurring value in a set of data
- In case of a tie for the most frequently occurring value, two modes are listed
and the data is said to be bimodal
- Datasets with two or more modes are referred to as multimodal
- Concept of mode is often used in determining sizes
- Appropriate descriptive summary measure for categorical data
Media - middle value in an ordered array of numbers
n n+1
- A way to locate the median is by finding the th term in the ordered array
2
- Large and small values do not inordinately influence the median – hence the
best measure of location to use in the analysis of variables in which extreme but
acceptable values can occur at just one end of the data
- Not all info from the dataset is used
- Data must be quantitative or be able to be ranked
Mean - Average of a set of numbers
- Sample mean is represented by X
- Population mean is represented by
- Data should be quantitative as it needs to be summed
- Affected by all values – advantage because it reflects all the data, but
disadvantage because extreme values pull the mean towards extremes
Can consider population mean or sample mean – if you denote r.v. by X , you have:
- Population mean is denoted by or E( X) , computed by
- Sample mean is denoted by X , computed by
Outlier: observation of the r.v. of interest whose value is far outside the range of other
realisations – often biases impressions about the distribution of r.v. in the dataset, we
may want to correct for such biases/simply remove such a data point
Downloaded by Shebnoor Ahmed ()
, lOMoARcPSD|4942262
DESCRIPTIVE STATISTIC: VARIABILITY
Measures of variability yield info about the likelihood of a realisation of the r.v. is away
from the centre of its distribution, describes the spread/dispersion of a dataset
Gives an idea of fluctuation and volatility across realisations of the r.v.
The more variability in a dataset, the less typical they are of the whole set
Using measures of variability in conjunction with measures of central tendency makes
possible a more complete numerical description of the data (measure of variability is
necessary to complement the mean value when describing data)
FIVE MEASURES OF VARIABILITY
Range - Maximum – minimum
- Crude measure of variability
- Advantage: ease of calculation; disadvantage: affected by extreme
values (thus application as a measure of variability is limited)
Inter-quartile - Distance between the first and third quartiles, IQR = Q 3−Q 1
range - Essentially the range of the middle 50% of the data
- useful when there is interest in values towards the middle rather
than values in the extremes
Variance - one is obtained from the other, they are presented together
- Variance and standard deviation measure out how spread out a r.v.
Standard is, the large the more spread out
deviation - involves considering how far each data value is from the mean and
describing this dispersion on average
- subtracting the mean from each value of data yields the deviation
from the mean: x−¿ - negative deviations represent values below
the mean, positive deviations represent values above the mean
VARIANCE
- Average squared distance between data points and their mean
- Sum of squared deviations from the mean of a set of values is called
the sum of squares of x : SS x
STANDARD DEVIATION
- Square root of the variance – has the same unit of the original data
- Estimate of the average distance that individual values are away
from the mean
Coefficient of - Standard deviation ÷ mean
variation
Downloaded by Shebnoor Ahmed ()
Bstats Notes
Business Statistics (University of Technology Sydney)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Shebnoor Ahmed ()
, lOMoARcPSD|4942262
LECTURE 1: INTRODUCTION AND DESCRIPTIVE
S TAT I S T I C S I
TYPES OF DATA
QUALITATIVE/CATEGORICAL
Mutually exclusive labels (one label cannot mean two things)
Not often numbers, if so, numbers have no mathematical meaning
- Nominal: ordering/ranking makes no sense, numerical labels are arbitrary
- Ordinal: ordering/ranking has meaning/can be interpreted, numerical labels respect
the ordering
QUANTITATIVE/NUMERICAL
Numbers used to record certain events, numbers have mathematical meaning
- Interval: quantity in difference is meaningful, but in ratio is not; zero has no natural
meaning
- Ratio: difference and ratio of two quantities is also meaningful; zero is meaningful
WORKING WITH CATEGORICAL DATA
Intuitive to tabulate and visualise, technique is frequency distribution
Frequency counts: total no of occurrences for each category
Relative frequency: fraction/proportion of the total no of data items belonging to that
category
Percent frequency: relative frequency x 100 (%)
Excel function COUNTIF, technique to use is frequency counts
To visualise: histogram (categories on x-axis, frequency/relative frequency/percent
frequency on y-axis) or pie chart
INTERMEZZO: THE LANGUAGE
Random variable (r.v.): a variable whose value appears randomly
- usually denoted by capital letters
- Realisations/observations of an r.v. are denoted by lowercase letters
- e.g. N and n denote the size/number of observations - N is referred to population size,
n denotes sample size (no of data points collected in a sample)
Population: collection of people, objects or items of interest; complete pool of certain
random variable
Sample: subset of a population; random collection of a certain size from the population
Probability distribution: general shape of probability for values that a random variable
may assume
DESCRIPTIVE STATISTIC: CENTRAL TENDENCY
Downloaded by Shebnoor Ahmed ()
, lOMoARcPSD|4942262
Measure of central tendency yields info about the centre of a set of numbers (distribution
of a r.v.’s) – does not focus on the span of the dataset or how far values are from middle
numbers
gives an idea of what a typical, middle, or average that a r.v. can take
sometimes called measures of location
THREE MEASURES OF CENTRAL TENDENCY
Mode - most frequently occurring value in a set of data
- In case of a tie for the most frequently occurring value, two modes are listed
and the data is said to be bimodal
- Datasets with two or more modes are referred to as multimodal
- Concept of mode is often used in determining sizes
- Appropriate descriptive summary measure for categorical data
Media - middle value in an ordered array of numbers
n n+1
- A way to locate the median is by finding the th term in the ordered array
2
- Large and small values do not inordinately influence the median – hence the
best measure of location to use in the analysis of variables in which extreme but
acceptable values can occur at just one end of the data
- Not all info from the dataset is used
- Data must be quantitative or be able to be ranked
Mean - Average of a set of numbers
- Sample mean is represented by X
- Population mean is represented by
- Data should be quantitative as it needs to be summed
- Affected by all values – advantage because it reflects all the data, but
disadvantage because extreme values pull the mean towards extremes
Can consider population mean or sample mean – if you denote r.v. by X , you have:
- Population mean is denoted by or E( X) , computed by
- Sample mean is denoted by X , computed by
Outlier: observation of the r.v. of interest whose value is far outside the range of other
realisations – often biases impressions about the distribution of r.v. in the dataset, we
may want to correct for such biases/simply remove such a data point
Downloaded by Shebnoor Ahmed ()
, lOMoARcPSD|4942262
DESCRIPTIVE STATISTIC: VARIABILITY
Measures of variability yield info about the likelihood of a realisation of the r.v. is away
from the centre of its distribution, describes the spread/dispersion of a dataset
Gives an idea of fluctuation and volatility across realisations of the r.v.
The more variability in a dataset, the less typical they are of the whole set
Using measures of variability in conjunction with measures of central tendency makes
possible a more complete numerical description of the data (measure of variability is
necessary to complement the mean value when describing data)
FIVE MEASURES OF VARIABILITY
Range - Maximum – minimum
- Crude measure of variability
- Advantage: ease of calculation; disadvantage: affected by extreme
values (thus application as a measure of variability is limited)
Inter-quartile - Distance between the first and third quartiles, IQR = Q 3−Q 1
range - Essentially the range of the middle 50% of the data
- useful when there is interest in values towards the middle rather
than values in the extremes
Variance - one is obtained from the other, they are presented together
- Variance and standard deviation measure out how spread out a r.v.
Standard is, the large the more spread out
deviation - involves considering how far each data value is from the mean and
describing this dispersion on average
- subtracting the mean from each value of data yields the deviation
from the mean: x−¿ - negative deviations represent values below
the mean, positive deviations represent values above the mean
VARIANCE
- Average squared distance between data points and their mean
- Sum of squared deviations from the mean of a set of values is called
the sum of squares of x : SS x
STANDARD DEVIATION
- Square root of the variance – has the same unit of the original data
- Estimate of the average distance that individual values are away
from the mean
Coefficient of - Standard deviation ÷ mean
variation
Downloaded by Shebnoor Ahmed ()