Analysis of Categorical v v v
Data
14.1 AvDescriptionvofvthevExperiment
14.2 ThevChi-SquarevTest
14.3 AvTestvofvavHypothesisvConcerningvSpecifiedv CellvProbabilities:v
AvGoodness-of-FitvTest
14.4 ContingencyvTables
14.5 rv×vcvTablesvwithvFixedvRowvorvColumnvTotals
14.6 OthervApplications
14.7 SummaryvandvConcludingvRemarksv
ReferencesvandvFurthervReadings
14.1 A Description of the Experiment
v v v v
Manyvexperimentsvresultvinvmeasurementsvthatvarevqualitativevorvcategoricalvratherv
thanvquantitiativevlikevmanyvofvthevmeasurementsvdiscussedvinvpreviousvchapters.vIn
vthese vinstances, vavqualityvorvcharacteristic visvidentified vfor veachvexperimental vunit.v
Datavassociatedvwithvsuchvmeasurementsvcanvbevsummarizedvbyvprovidingvthevcountv
ofvthevnumbervofvmeasurementsvthatvfallvintoveachvofvthevdistinctvcategoriesvassociatedv
withvthevvariable.vForvexample,
• Employeesvcanvbevclassifiedvintovonevofvfivevincomevbrackets.
• Micevmightvreactvinvonevofvthreevwaysvwhenvsubjectedvtovavstimulus.
• Motorvvehiclesvmightvfallvintovonevofvfourvvehiclevtypes.
• Paintingsv couldv bev classifiedv intov onev ofv kv categoriesv accordingv tov stylev and
period.
• Thevqualityvofvsurgicalvincisionsvcouldvbev mostv meaningfullyvbevidentifiedvas
excellent,vveryvgood,vgood,vfair,vorvpoor.
• Manufacturedvitemsvarevacceptable,vseconds,vorvrejects.
Allvthevprecedingvexamplesvexhibit,vtovavreasonablevdegreevofvapproximation,vthevfo
llowingvcharacteristics,vwhichvdefinevavmultinomialvexperimentv(seevSectionv5.9):
713
,714 Chapterv14 AnalysisvofvCategoricalvData
1. Thevexperimentvconsistsvofvnvidenticalvtrials.
2. Thevoutcomevofveachvtrialvfallsvintovexactlyvonevofvkv distinctvcategoriesvorvcells.
3. Thevprobabilityvthatvthevoutcomevofvavsinglevtrialvwillvfallvinvavparticularvcell,v
cellviv,visvpiv,vwhereviv v =1,v2 , . . . , vk,vandvremainsvthevsamevfromvtrialvtovtrial.v
Noticevthat
p1v+vp2v+vp3v+v· · · v+vpkv =v1.
4. Thevtrialsvarevindependent.
5. Wevarevinterestedvinvn1,vn2,vn 3 , . . . , vnkv,vwherevnivforviv v =
1,v2 , . . . , vkvisvequalvt
ovthev numberv ofv trialsv forv whichv thev outcomev fallsv intov cellv iv.v Noticev thatv
n1v+vn2v+vn3v+v· · · v+vnkv =vn.
Thisvexperimentvisvanalogousvtovtossingvnvballsvatvkvboxes,vwhereveachvballvmustv
fallvintovexactlyvonevofvthevboxes.vThevprobabilityvthatvavballvwillvfallvintovavboxvvaries
vfromvboxvtovboxvbutvremainsvthevsamevfor veachvboxvinvrepeatedvtosses.vFinally, vthev
ballsvarevtossedvinvsuchvavwayvthatvthevtrialsvarevindependent.vAtvthevconclusionvofvt
hevexperiment,vwevobservevn1vballsvinvthevfirstvbox,vn2vinvthevsecond,v. . . v,vandvnkv invt
hevkth.vThevtotalvnumbervofvballsvisvnv = v n1v v +
n2v v n+
3 +v · · · v + nkv.
Noticevthevsimilarityvbetweenvthevbinomialvandvthevmultinomialvexperimentsvand,
vinvparticular,vthatvthevbinomialvexperiment vrepresentsvthevspecialvcasevforvthevmulti-
vnomial vexperiment vwhenvkv=2.vThevtwo- = −
cellvprobabilities,vpvandvqv 1v p,vofvthevbinomialvexperimentvarevreplacedvbyvthevk-
cellvprobabilities,v p1,vp 2 , . .. , vpkv,vofvthevmultinomialvexperiment.vThevobjectivevofvt
hisvchaptervisvtovmakevinferencesvaboutvthevcellvprobabilitiesv p1,vp 2 , . .. , vpkv.vThevin
ferencesvwillvbevexpressedvinvtermsvofvstatisticalvtestsvofvhypothesesvconcerningvthev
specificvnumericalvvaluesvofvthevcellvprobabilitesvorvtheirvrelationshipvonevtovanother
.
Becausevthevcalculationvofvmultinomialvprobabilitiesvisvsomewhatvcumbersome,vi
tvwouldvbevdifficultvtovcalculatevthevexactvsignificancevlevelsv(probabilitiesvofvtypevIv
errors)vforvhypothesesvregardingvthevvaluesvofv p1,vp 2 , . .. , vpkv.vFortunately,vwevhavev
beenvrelievedvofvthisvchorevbyvthevBritishvstatisticianvKarlvPearson,vwhovproposedvav
veryvusefulvtestvstatisticvforvtestingvhypothesesvconcerningvp1,vp2,...,vpkvandvgavevthev
approximatevsamplingvdistributionvofvthisvstatistic.vWevwillvoutlinevthevconstruction
vofvPearson’svtestvstatisticvinvthevfollowingvsection.
14.2 The Chi-Square Test
v v
Supposevthatvnv =
v 100vballsvwerevtossedvatvthevcellsv(boxes)vandvthatvwevknewvthatvp1v
wasvequalvtov.1.vHowvmanyvballsvwouldvbevexpectedvtovfallvintovcellv1?vReferringvtov
Sectionv5.9,vrecallvthatvn1vhasvav(marginal)vbinomialvdistributionvwithvparametersvnv
andv p1,vandvthat
E(n1)v=vnp1v =v(100)(.1)v=v10.
Invlikevmanner,veachvofvthevniv’svhavevbinomialvdistributionsvwithvparametersvnvandv pi
andvthevexpectedvnumbersvfallingvintovcellviv is
E(niv)v=v npiv, iv =v1,v2 , . . . , vk.
, 14.2 ThevChi-SquarevTest 715
Nowvsupposevthatvwevhypothesizevvaluesvforv p1,vp 2 ,. . ., vpkv andvcalculatevtheve
xpectedvvaluevforveachvcell.vCertainlyvifvourvhypothesisvisvtrue,vthevcellvcountsvnivsho
uldvnotvdeviatevgreatlyvfromvtheirvexpectedvvaluesvnpivforviv v 1,v2=, . . . , vk.vHence,vitvw
ouldvseemvintuitivelyvreasonablevtovusevavtestvstatisticvinvolvingvthevkvdeviations,
niv −vE(niv)v=vniv −vnpiv, forviv=v1,v2,..., vk.
Inv1900vKarlvPearsonvproposedvthevfollowingvtestvstatistic,vwhichvisvavfunctionvofvthe
vsquaresvofvthevdeviationsvofvthevobservedvcountsvfromvtheirvexpectedvvalues,vweighte
dvbyvthevreciprocalsvofvtheirvexpectedvvalues:
[nvi −v E(nvi )]
k k
[nvi − npvi]v
2 2
X = =
2
E(niv) v npi
iv=1 iv=1
Althoughvthevmathematicalvproofvisvbeyondvthevscopevofvthisvtext,vitvcanvbevshownvt
hatvwhenvnvisvlarge,vXv2vhasvanvapproximatevchi-
squarev(χv2)vprobabilityvdistribution.vWevcanveasilyvdemonstratevthisvresultvforvthevc
asevkv =v2,vasvfollows.vIfvkv =v2,vthen
n2v =vnv−vn1v andv p1v+2vp2v =v1.vThus,
2 2 2
Σv[niv−vE(niv)]v v (n1v−vnp1)vv (n2v−vnp2)v
Xv2v=v E(niv) =v np1 +v np2
iv=1
(n1 — np1)2 [(nv−vn1)v−vn(1v−vp1)]2v
= +
np1 n(1v−vp1)
(n1 — np1)2 (−n1v+vnp1)2
= +
np1 vn(1v−vp1)
v v
=v (n1v−vnp1) .
2
2 v1v +v 1
=v(n1v−vnp1) npv n(1v−vpv ) npv (1v−vpv )v
1 1 1 1
Wevhavevseenv(Sectionv7.5)vthatvforvlargevn
vn1vv v np1
—
√ v
np1(1v−v p1)
hasv approximatelyv av standardv normalv distribution.v Sincev thev squarev ofv av standard
normalvrandomvvariablevhasvavχv2vdistributionv(seevExamplev6.11),vforvk = 2vandvlarge
n,v Xv2vhasvanvapproximatevχv2vdistributionvwithv1vdegreevofvfreedomv(df).
Experiencevhasvshownvthatvthevcellvcountsvniv shouldvnotvbevtoovsmallvifvthevχv2v dist
ributionvisvtovprovidevanvadequatevapproximationvtovthevdistributionvofvXv2.vAsvavrulevo
fvthumb,vwevwillvrequirevthatvallvexpectedvcellvcountsvarevatvleastvfive,valthoughvCochr
anv(1952)vhasvnotedvthatvthisvvaluevcanvbevasvlowvasvonevforvsomevsituations.
Youvwillvrecallvthevusevofvthevχv2vprobabilityvdistributionvforvtestingvavhypothesisvco
ncerningvavpopulationvvariancevσv2vinvSectionv10.9.vInvparticular,vwevhavevseenvthat
thevshapevofvthevχv2vdistributionvandvthevassociatedvquantilesvandvtailvareasvdiffervcon-
vsiderablyvdependingvonvthe vnumber vofvdegreesvofvfreedomv(seevTablev6,vAppendixv3)
.vTherefore,vifvwevwantvtovusevXv2vasvavtestvstatistic,vwevmustvknowvthevnumbervofvdegree
svofvfreedomvassociatedvwithvthevapproximatingvχv2vdistributionvandvwhethervtovusev
avone-tailedv orv two-
tailedv testv inv locatingv thev rejectionv regionv forv thev test.v Thev latter
, 716 Chapterv14 AnalysisvofvCategoricalvData
problemvmayvbevsolvedvdirectly.vBecausevlargevdifferencesvbetweenvthevobservedvand
vexpected vcellvcountsvcontradictvthevnullvhypothesis, vwevwillvrejectvthevnullvhypothesi
svwhenvXv2visvlargevandvemployvanvupper-tailedvstatisticalvtest.
Thevdeterminationvofvthevappropriatevnumbervofvdegreesvofvfreedomvtovbevemployed
vforvthevtestvcanvbevavlittle vtrickyvand vthereforevwillvbevspecifiedvfor vthevphysical vapp
licationsvdescribedvinvthevfollowingvsections.vInvaddition,vwevwillvstatevthevprinciplevinv
olvedv(whichvisvfundamentalvtovthevmathematicalvproofvofvthevapproximation)v sovt
hatvyouvwillvunderstandvwhyvthevnumbervofvdegreesvofvfreedomvchangesvwithvvario
usvapplications.vThisvprinciplevstatesvthatvthevappropriatevnumbervofvdegreesvofvfreed
omvwillvequalvthevnumbervofvcells,vk,vlessv1vdfvforveachvindependentvlinearvrestric-
vtionvplacedvonvthevcellvprobabilities.vForvexample,vonevlinearvrestrictionvisvalwaysvp
resentvbecausevthevsumvofvthevcellvprobabilitiesvmustvequalv1;vthatvis,
p1v+vp2v+vp3v+v· · · v+vpkv =v1.
Othervrestrictionsvwillvbevintroducedvforvsomevapplicationsvbecausevofvthevnecessityv
forvestimatingvunknownvparametersvrequiredvinvthevcalculationvofvthevexpectedvcellv
frequenciesvorvbecausevofvthevmethodvusedvtovcollectvthevsample.vWhenvunknownvp
arametersvmustvbevestimatedvinvordervtovcomputevXv2,vavmaximum-likelihoodvesti-
vmatorv(MLE) vshould vbevemployed.vThevdegreesvofvfreedomvforvthevapproximating vχv
2
vdistributionvisvreducedvbyv1 vforveachvparameter vestimated. vThesevcasesvwillvariseva
svwevconsidervvariousvpracticalvexamples.
14.3 A Test of a Hypothesis Concerning S
v v v v v v
pecified Cell Probabilities: v v
A Goodness-of-Fit Test
v v
Thevsimplestvhypothesisvconcerningvthevcellvprobabilitiesvisvonevthatvspecifiesvnumer-
vical vvaluesvfor veach. vInvthisvcase,vwevarevtestingvH0v:v p1
= p1,0,vp2 = =
p 2 , 0 ,..., vpkvpk,0,v wherev
pi,0v denotesv av specifiedv valuev forv piv.v Thev alternativev isv thev generalvonev thatv stat
esv thatv atv leastv onev ofv thev equalitiesv doesv notv hold.v Becausev thev only
Σk iv=1
restrictionv onv thev cell v probabilitiesv isv thatv piv =v 1,v thev Xv2v testv statisticv has
approximatelyv a χ 2vdistributionvwithvk — 1vdf.v
EXAMPLE v 14.1v v Avgroupvofvrats,vonevbyvone,vproceedvdownvavrampvtovonevofvthreevdoors.vWevwishvtovt
estvthevhypothesisvthatvthevratsvhavevnovpreferencevconcerningvthevchoicevofvavdoor.v
Thus,vthevappropriatevnullvhypothesisvis
1
H0v:v p1 =v p2 =v p3 =v ,
3v
wherev piv isvthevprobabilityvthatvavratvwillvchoosevdoorviv,vforviv =v1,v2,vorv3.
Supposevthatvthevratsvwerevsentvdownvthevrampvnv =v90vtimesvandvthatvthevthreevo
bservedvcellvfrequenciesvwerevn1v =v23,vn2v =v36,vandvn3v =v31.vThevexpectedvcellvfr
equencyvarevthevsamevforveachvcell:vE(niv)v=vnpiv =v(90)(1/3)v=v30.vThevobserved