Analysing data using SPSS
(A practical guide for those unfortunate enough to have to actually do it.)
Andrew Garth, Sheffield Hallam University, 2008
Contents:
What this document covers... 2
Types of Data. 3
Structuring your data for use in SPSS 6
Part 1 - Creating descriptive statistics and graphs. 11
SPSS versions 11
Entering and saving Data. 11
Saving Your Work 12
Looking at the Data 15
Exploring the data. 16
More on drawing Boxplots 16
Using Descriptive Statistics 17
More on different types of data 19
The difference between Mean and Median 19
Standard Deviation (S.D.) what is it? 21
Histograms and the Normal Distribution 25
Bar charts. 30
Using Scatterplots to look for correlation 34
Line graphs. 36
Pie charts 40
Part 2 - Inferential Statistics. 43
From Sample to Population... 43
A Parametric test example 44
Using a Non-parametric Test 47
Observed Significance Level 49
Asymptotic significance (asymp. Sig.) 50
Exact significance (exact sig.) 50
Testing Paired Data 51
Correlation 53
Significance in perspective. 56
Looking for correlation is different from looking for increases or decreases 57
Correlation: descriptive and inferential statistics 57
What have we learned so far? 58
Test decision chart. 59
The Chi-Square Test. 62
Cross-tabulation 62
Some examples to get your teeth into. 68
Analysis of Variance - one-way ANOVA 71
Repeated measures ANOVA. 77
Making sense of the repeated measures ANOVA output. 78
Inter-Rater Agreement using the Intraclass Correlation Coefficient 82
Cronbach's Alpha 83
Inter rater agreement using Kappa 84
Calculating the sensitivity and specificity of a diagnostic test 86
Copying information from SPSS to other programs 87
More about parametric and nonparametric tests 89
Creating a new variable in SPSS based on an existing variable 91
Acknowledgements.
Thanks are due to Jo Tomalin whose original statistical resources using the Minitab software were invaluable
in developing this resource. Thanks also go to the numerous students and colleagues who have allowed the
use of their research data in the examples.
,What this document covers...
This document is intended to help you draw conclusions from your data by statistically
analysing it using SPSS (Statistical Package for the Social Sciences). The contents are set
out in what seems a logical order to me however if you are in a rush, or you don't conform
to my old fashioned linear learning model then feel free to jump in at the middle and work
your way out! Most researchers will be working to a protocol that they set out way before
gathering their data, if this is the case then theoretically all you need to do is flip to the
pages with the procedures you need on and apply them. It is however my experience that
many researchers gather data and then are at a loss for a sensible method of analysis, so I'll
start by outlining the things that should guide the researcher to the appropriate analysis.
Q. How should I analyse my data?
A. It depends how you gathered them and what you are looking for.
There are four areas that will influence you choice of analysis;
1 The type of data you have gathered, (i.e. Nominal/Ordinal/Interval/Ratio)
2 Are the data paired?
3 Are they parametric?
4 What are you looking for? differences, correlation etc?
These terms will be defined as we go along, but also remember there is a glossary as well
as an index at the end of this document.
This may at first seem rather complex, however as we go through some examples it should
be clearer.
I'll quickly go through these four to help start you thinking about your own data.
The type of data you gather is very important in letting you know what a sensible method
of analysis would be and of course if you don't use an appropriate method of analysis your
conclusions are unlikely to be valid. Consider a very simple example, if you want to find
out the average age of cars in the car park how would you do this, what form of average
might you use? The three obvious ways of getting the average are to use the mean, median
or mode. Hopefully for the average of car you would use the mean or median. How might
we though find the average colour of car in the car park? It would be rather hard to find the
mean! for this analysis we might be better using the mode, if you aren't sure why consult
the glossary. You can see then even in this simple example that different types of data can
lend themselves to different types of analysis.
In the example above we had two variables, car age and car colour, the data types were
different, the age of car was ratio data, we know this because it would d be sensible to say
"one car is twice as old as another". The colour however isn't ratio data, it is categorical
(often called nominal by stats folk) data.
2
,Types of Data.
Nominal Data: These are data which classify or categorise some
attribute they may be coded as numbers but the numbers has no real
meaning, its just a label they have no default or natural order.
Examples:, town of residence, colour of car, male or female (this lat
one is an example of a dichotomous variable, it can take two mutually
exclusive values.
More restricted
Ordinal Data: These are data that can be put in an order, but don’t in how they can
have a numerical meaning beyond the order. So for instance, the be analysed
difference between 2 and 4 in the example of a Lickert scale below
might no be the same as the difference between 2 and 5. Examples:
Questionnaire responses coded: 1 = strongly disagree, 2 = disagree, 3
= indifferent, 4 = agree, 5 = strongly agree. Level of pain felt in joint
rated on a scale from 0 (comfortable) to 10 (extremely painful).
Interval Data: These are numerical data where the distances between
numbers have meaning, but the zero has no real meaning. With interval
data it is not meaningful to say than one measurement is twice another,
and might not still be true if the units were changed. Example:
Temperature measured in Centigrade, a cup of coffee at 80°c isn't
twice as hot a one at 40°c. Less restricted
in how they can
Ratio Data: These are numerical data where the distances between be analysed
data and the zero point have real meaning. With such data it is
meaningful to say that one value is twice as much as another, and this
would still be true if the units were changed. Examples: Heights,
Weights, Salaries, Ages. If someone is twice as heavy as someone else
in pounds, this will still be true in kilograms.
Typically only data from the last two types might be suitable for parametric methods,
although as we'll see later it isn't always a completely straight forward decision and when
documenting research it is reasonable to justify the choice of analysis to prevent the reader
believing that the analysis that best supported the hypothesis was chosen rather than the
one most appropriate to the data. The important thing in this decision, as I hope we'll see,
is not to make unsupported assumptions about the data and apply methods assuming
"better" data than you have.
Are your data paired?
Paired data are often the result of before and after situations, e.g. before and after
treatment. In such a scenario each research subject would have a pair of measurements and
it might be that you look for a difference in these measurements to show an improvement
due to the treatment. In SPSS that data would be coded into two columns, each row would
hold the before and the after measurement for the same individual.
We might for example measure the balance performance of 10 subjects with a Balance
Performance Monitor (BPM) before and after taking a month long course of exercise
3
, designed to improve balance. Each subject would have a pair of balance readings. This
would be paired data. In this simple form we could do several things with the data; we
could find average reading for the balance (Means or Medians), we could graph the data on
a boxplot this would be useful to show both level and spread and let us get a feel for the
data and see any outliers.
In the example as stated above the data are paired, each subject has a pair of numbers.
What if you made your subjects do another month of exercise and measured their balance
again, each subject would have three numbers, the data would still be paired, but rather
than stretch the English language by talking about a pair of three we call this repeated
measures. This would be stored in three columns in SPSS.
A word of warning, sometimes you might gather paired data (as above, before we
pretended there was a third column of data) but end up with independent groups. Say, for
example, you decided that the design above was floored (which it is) and doesn't take into
account the fact that people might simply get better at balancing on the balance
performance monitor due to having had their first go a month before. i.e. we might see an
increase in balance due to using the balance monitor! to counter this possible effect we
could recruit another group of similar subjects, these would be assessed on the BPM but
not undertake the exercise sessions, consequently we could asses the effect of
measurement without exercise on this control group. We then have a dilemma about how
to treat the two sets of data. We could analyse them separately and hope to find a
significant increase in balance in our treatment group but not in the non exercise group. A
better method would be to calculate the change in balance for each individual and see if
there is a significant difference in that change between the groups. This latter method ends
with the analysis actually being carried out on non-paired data. (An alternative analysis
would be to use a two factor mixed factorial ANOVA - but that sounds a bit too hard just
now! - maybe later.)
If you are not sure whether two columns of data are paired or not, consider whether
rearranging the order of one of the columns would affect your data. If it would, they are
paired. Paired data often occur in ‘before and after’ situations. They are also known as
‘related samples’. Non-paired data can also be referred to as ‘independent samples’.
Scatterplots (also called scattergrams) are only meaningful for paired data.
Parametric or Nonparametric data
Before choosing a statistical test to apply to your data you should address the issue of
whether your data are parametric or not. This is quite a subtle and convoluted decision but
the guide line here should help start you thinking, remember the important rule is not to
make unsupported assumptions about the data, don't just assume the data are parametric;
you can use academic precedence to share the blame "Bloggs et. al. 2001 used a t-test so I
will" or you might test the data for normality, we'll try this later, or you might decide that
given a small sample it is sensible to opt for nonparametric methods to avoid making
assumptions.
• Ranks, scores, or categories are generally non-parametric data.
• Measurements that come from a population that is normally distributed can usually
be treated as parametric.
4
(A practical guide for those unfortunate enough to have to actually do it.)
Andrew Garth, Sheffield Hallam University, 2008
Contents:
What this document covers... 2
Types of Data. 3
Structuring your data for use in SPSS 6
Part 1 - Creating descriptive statistics and graphs. 11
SPSS versions 11
Entering and saving Data. 11
Saving Your Work 12
Looking at the Data 15
Exploring the data. 16
More on drawing Boxplots 16
Using Descriptive Statistics 17
More on different types of data 19
The difference between Mean and Median 19
Standard Deviation (S.D.) what is it? 21
Histograms and the Normal Distribution 25
Bar charts. 30
Using Scatterplots to look for correlation 34
Line graphs. 36
Pie charts 40
Part 2 - Inferential Statistics. 43
From Sample to Population... 43
A Parametric test example 44
Using a Non-parametric Test 47
Observed Significance Level 49
Asymptotic significance (asymp. Sig.) 50
Exact significance (exact sig.) 50
Testing Paired Data 51
Correlation 53
Significance in perspective. 56
Looking for correlation is different from looking for increases or decreases 57
Correlation: descriptive and inferential statistics 57
What have we learned so far? 58
Test decision chart. 59
The Chi-Square Test. 62
Cross-tabulation 62
Some examples to get your teeth into. 68
Analysis of Variance - one-way ANOVA 71
Repeated measures ANOVA. 77
Making sense of the repeated measures ANOVA output. 78
Inter-Rater Agreement using the Intraclass Correlation Coefficient 82
Cronbach's Alpha 83
Inter rater agreement using Kappa 84
Calculating the sensitivity and specificity of a diagnostic test 86
Copying information from SPSS to other programs 87
More about parametric and nonparametric tests 89
Creating a new variable in SPSS based on an existing variable 91
Acknowledgements.
Thanks are due to Jo Tomalin whose original statistical resources using the Minitab software were invaluable
in developing this resource. Thanks also go to the numerous students and colleagues who have allowed the
use of their research data in the examples.
,What this document covers...
This document is intended to help you draw conclusions from your data by statistically
analysing it using SPSS (Statistical Package for the Social Sciences). The contents are set
out in what seems a logical order to me however if you are in a rush, or you don't conform
to my old fashioned linear learning model then feel free to jump in at the middle and work
your way out! Most researchers will be working to a protocol that they set out way before
gathering their data, if this is the case then theoretically all you need to do is flip to the
pages with the procedures you need on and apply them. It is however my experience that
many researchers gather data and then are at a loss for a sensible method of analysis, so I'll
start by outlining the things that should guide the researcher to the appropriate analysis.
Q. How should I analyse my data?
A. It depends how you gathered them and what you are looking for.
There are four areas that will influence you choice of analysis;
1 The type of data you have gathered, (i.e. Nominal/Ordinal/Interval/Ratio)
2 Are the data paired?
3 Are they parametric?
4 What are you looking for? differences, correlation etc?
These terms will be defined as we go along, but also remember there is a glossary as well
as an index at the end of this document.
This may at first seem rather complex, however as we go through some examples it should
be clearer.
I'll quickly go through these four to help start you thinking about your own data.
The type of data you gather is very important in letting you know what a sensible method
of analysis would be and of course if you don't use an appropriate method of analysis your
conclusions are unlikely to be valid. Consider a very simple example, if you want to find
out the average age of cars in the car park how would you do this, what form of average
might you use? The three obvious ways of getting the average are to use the mean, median
or mode. Hopefully for the average of car you would use the mean or median. How might
we though find the average colour of car in the car park? It would be rather hard to find the
mean! for this analysis we might be better using the mode, if you aren't sure why consult
the glossary. You can see then even in this simple example that different types of data can
lend themselves to different types of analysis.
In the example above we had two variables, car age and car colour, the data types were
different, the age of car was ratio data, we know this because it would d be sensible to say
"one car is twice as old as another". The colour however isn't ratio data, it is categorical
(often called nominal by stats folk) data.
2
,Types of Data.
Nominal Data: These are data which classify or categorise some
attribute they may be coded as numbers but the numbers has no real
meaning, its just a label they have no default or natural order.
Examples:, town of residence, colour of car, male or female (this lat
one is an example of a dichotomous variable, it can take two mutually
exclusive values.
More restricted
Ordinal Data: These are data that can be put in an order, but don’t in how they can
have a numerical meaning beyond the order. So for instance, the be analysed
difference between 2 and 4 in the example of a Lickert scale below
might no be the same as the difference between 2 and 5. Examples:
Questionnaire responses coded: 1 = strongly disagree, 2 = disagree, 3
= indifferent, 4 = agree, 5 = strongly agree. Level of pain felt in joint
rated on a scale from 0 (comfortable) to 10 (extremely painful).
Interval Data: These are numerical data where the distances between
numbers have meaning, but the zero has no real meaning. With interval
data it is not meaningful to say than one measurement is twice another,
and might not still be true if the units were changed. Example:
Temperature measured in Centigrade, a cup of coffee at 80°c isn't
twice as hot a one at 40°c. Less restricted
in how they can
Ratio Data: These are numerical data where the distances between be analysed
data and the zero point have real meaning. With such data it is
meaningful to say that one value is twice as much as another, and this
would still be true if the units were changed. Examples: Heights,
Weights, Salaries, Ages. If someone is twice as heavy as someone else
in pounds, this will still be true in kilograms.
Typically only data from the last two types might be suitable for parametric methods,
although as we'll see later it isn't always a completely straight forward decision and when
documenting research it is reasonable to justify the choice of analysis to prevent the reader
believing that the analysis that best supported the hypothesis was chosen rather than the
one most appropriate to the data. The important thing in this decision, as I hope we'll see,
is not to make unsupported assumptions about the data and apply methods assuming
"better" data than you have.
Are your data paired?
Paired data are often the result of before and after situations, e.g. before and after
treatment. In such a scenario each research subject would have a pair of measurements and
it might be that you look for a difference in these measurements to show an improvement
due to the treatment. In SPSS that data would be coded into two columns, each row would
hold the before and the after measurement for the same individual.
We might for example measure the balance performance of 10 subjects with a Balance
Performance Monitor (BPM) before and after taking a month long course of exercise
3
, designed to improve balance. Each subject would have a pair of balance readings. This
would be paired data. In this simple form we could do several things with the data; we
could find average reading for the balance (Means or Medians), we could graph the data on
a boxplot this would be useful to show both level and spread and let us get a feel for the
data and see any outliers.
In the example as stated above the data are paired, each subject has a pair of numbers.
What if you made your subjects do another month of exercise and measured their balance
again, each subject would have three numbers, the data would still be paired, but rather
than stretch the English language by talking about a pair of three we call this repeated
measures. This would be stored in three columns in SPSS.
A word of warning, sometimes you might gather paired data (as above, before we
pretended there was a third column of data) but end up with independent groups. Say, for
example, you decided that the design above was floored (which it is) and doesn't take into
account the fact that people might simply get better at balancing on the balance
performance monitor due to having had their first go a month before. i.e. we might see an
increase in balance due to using the balance monitor! to counter this possible effect we
could recruit another group of similar subjects, these would be assessed on the BPM but
not undertake the exercise sessions, consequently we could asses the effect of
measurement without exercise on this control group. We then have a dilemma about how
to treat the two sets of data. We could analyse them separately and hope to find a
significant increase in balance in our treatment group but not in the non exercise group. A
better method would be to calculate the change in balance for each individual and see if
there is a significant difference in that change between the groups. This latter method ends
with the analysis actually being carried out on non-paired data. (An alternative analysis
would be to use a two factor mixed factorial ANOVA - but that sounds a bit too hard just
now! - maybe later.)
If you are not sure whether two columns of data are paired or not, consider whether
rearranging the order of one of the columns would affect your data. If it would, they are
paired. Paired data often occur in ‘before and after’ situations. They are also known as
‘related samples’. Non-paired data can also be referred to as ‘independent samples’.
Scatterplots (also called scattergrams) are only meaningful for paired data.
Parametric or Nonparametric data
Before choosing a statistical test to apply to your data you should address the issue of
whether your data are parametric or not. This is quite a subtle and convoluted decision but
the guide line here should help start you thinking, remember the important rule is not to
make unsupported assumptions about the data, don't just assume the data are parametric;
you can use academic precedence to share the blame "Bloggs et. al. 2001 used a t-test so I
will" or you might test the data for normality, we'll try this later, or you might decide that
given a small sample it is sensible to opt for nonparametric methods to avoid making
assumptions.
• Ranks, scores, or categories are generally non-parametric data.
• Measurements that come from a population that is normally distributed can usually
be treated as parametric.
4