Variable types
- Categorical / qualitative variable: when a variable consists of categories (norms or labels) and
answers questions about how cases fall into those categories
- Numerical / quantitative variable: when a variable has measured numerical values with units and
the variable tells us about the quantity of what is measured
Categorical variables
- Arise from descriptive responses to questions like ‘what kind of advertising do you use?’
- May only have 2 possible values (like yes or no)
- May be a number like a zip code
- Is also called nominal, since the label for each category are merely distinguishing norms
- A dummy variable is a 0-1 coded variable for a specific category. It is coded 1 for all observations
in that category and 0 for all observations that are not in that category
- Categorising a numerical variable as categorical is called binning (putting data into discrete bins)
Numerical variables
- A numerical variable has units
- Is discrete if it results from a count, such as the number of children.
- A continuous variable is the result of an essentially continuous measurement, such a weight or
height
- Cross-sectional data are data on a cross section of a population at a distinct point in time. Time
series data are data collected over time
A categorical variable has 3 aspects
- The number of categories
- the categories names
- The number of observations, or counts, in each category
Displaying quantitative variables
Histograms
- Similar to a bar chart with the bin counts used as the heights of the bars. There is no gap
between bars unless there are actual gaps in the data
- Decide how wide to make the bins
- Determine the count of each bin
- Decide where to place values that land on the endpoint of a bin
- Can choose to create a relative frequency histogram by displaying the % of cases in each bin
instead of the count. This makes the histogram more like a density function
When describing a distribution, attention should be paid to
- The shape (including any outliers)
- The center
- The spread
Shape
,- Mode: (1) peaks of humps seen in a histogram are called the modes of the distribution (2) a
distribution whose histogram has one main peak is called unimodal, two peaks- bimodal, 3 or
more- multimodal
- Symmetry: (1) a distribution is symmetric if the halves on either side of the center look, at least
approximately, like mirror images (2) the thinner ends of a distribution are called the tails. If one
tail stretches out further than the other, the distribution is called skewed to the side of the
longer tail.
- Outliers: (1) values that stand off away from the body of the distribution (2) can affect every
statistical method (3) can be the most informative part of the data (4) may be an error in the
data (5) should be discussed in any conclusions drawn about the data
- Characterising the shape of a distribution is often a judgement call
- An honest desire to understand what is happening in the data should guide decisions
Center
- To find the mean, add up all the values and divide by the number of values
- If a distribution is skewed, contain gaps, or contain outliers, then it might be better to use the
median- the value that splits the histogram into 2 equal areas
- The median is said to be resistant because it isn’t affected by unusual observations or by the
shape of the distribution therefore the median is a better choice for skewed data
- If a distribution is roughly symmetric, the mean and median is expected to be close
Spread
- The more the data vary, the less a measure of center can tell us
- Range: defined as the difference between the extremes = max – min
- Interquartile rage (IQR) = Q3 – Q1
- Variance = average of the squared deviation of the values of the variable of y from the mean
- Taking the square root of the variances gives us the SD
Summary
- If the shape is skewed, the median and IQR should be reported
- If the shape is unimodal and symmetric, the mean and SD and possibly the median and IQR
should be reported
- If there are multiple modes, try to determine if the data can be split into separate groups
- If there are unusual observations, point them out and report the mean and SD with and without
the values
- Always pair the median with the IQR and the mean with the SD
5 number summary and boxplots
- The five number summary reports the median, the quartiles and the extremes
,- Once we have the five number summary, we can display the information in a boxplot
Boxplot
- The central box shows the middle half of the data, between the quartiles- the height/length of
the box equals to the IQR
- If the median is roughly centred between the quartiles, then the middle half of the data is
roughly symmetric. If it is not centred, the distribution is skewed
- The whiskers show skewness as well if they are not roughly the same length
- The outliers are displayed individually to keep them out of the way in judging skewness and to
display them for special attention
How to draw a boxplot
- Locate the median and quartiles on an axis and draw a 3 short lines.
- Connect the quartile lines to form a box
- Erect ‘fences’ around the main part of the data, placing the upper fence 1.5 IQRs above the
upper quartile and lower fence 1.5 IQRs below the lower quartile
- Draw whiskers from each end of the box up and down to the most extreme data values found
within the fences
- Add any outliers by displaying data values that lie beyond the fences with special symbols
7 number and boxplot summary
- The box is same as before
- The whiskers end at 5th and 95th percentile
- Beyond those, only the min and max are plotted as points
- The average is included with an x
Population and parameters
- Models use mathematics to represent reality
- A population includes all of the entities of interest, whether they be people, households,
machines or whatever
- Key numbers in the population are called parameters
- A sample is a subset of the population, often randomly chosen
, - Any summary formed from the data is a subset
- Sometimes, especially we match statistics with the parameter they estimate, we use the term
sample statistic. Some statistics are used as estimates for the population parameters
- A sample that estimates the corresponding parameters accurately is said to be representatives
A data set is usually a rectangular array of data, with variables in columns and observations in row
A variable is a characteristic of members of a population, such as height, gender or salary
An observation is a list of all variable values for a single member of a population
Common sampling designs
Simple random sample (SRS)
- A sample drawn, so that every possible sample has an equal chance of being selected
- With this method, each combination of individuals has equal chance of being selected as well
- A sampling frame is a list of individuals from which the sample can be drawn
- Once we have the sampling frame, we can assign the sequential number to each individual in the
sampling frame and draw random numbers to identify those to be sampled
- Sample-to-sample differences in the values for the variables we measured are called sampling
variability
Stratified sampling (basically sticking to the % of all the sample size)
- When we slice the population into homogeneous groups, called strata, use simple random
sampling within each stratum, and combine the results at the end
- Reduced sampling variability (since a rare stratum is never left out) is the most important benefit
of stratifying
Cluster and multistage sampling
- Isolate from the population parts or clusters that represent the population, and performing a
census (complete counts) within one of a few clusters at random, is called cluster sampling