StatisticalMethods
FourthEdition
DONNAL.MOHR
UniversityofNorthFlorida,Emeritus
WILLIAMJ.WILSON
UniversityofNorthFlorida,Emeritus
RUDOLFJ.FREUND
AcademicPressisanimprintofElsevier
125LondonWall,LondonEC2Y5AS,UnitedKingdom
525BStreet,Suite1650,SanDiego,CA92101,UnitedStates
50HampshireStreet,5thFloor,Cambridge,MA02139,UnitedStates
TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UnitedKingdom
Copyright r 2022ElsevierInc.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorageand retrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowtoseek permission,furtherinformationaboutthePublisher’spermissionspoliciesandourarrangementswith organizationssuchastheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbe foundatourwebsite: www.elsevier.com/permissions
Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythe Publisher(otherthanasmaybenotedherein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperience broadenourunderstanding,changesinresearchmethods,professionalpractices,ormedical treatmentmaybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluating andusinganyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuch informationormethodstheyshouldbemindfuloftheirownsafetyandthesafetyofothers, includingpartiesforwhomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors, assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterofproducts liability,negligenceorotherwise,orfromanyuseoroperationofanymethods,products, instructions,orideascontainedinthematerialherein.
BritishLibraryCataloguing-in-PublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary
LibraryofCongressCataloging-in-PublicationData
AcatalogrecordforthisbookisavailablefromtheLibraryofCongress
ISBN:978-0-12-823043-5
ForInformationonallAcademicPresspublications visitourwebsiteat https://www.elsevier.com/books-and-journals
ContentStrategist: KateyBirtcher
ContentDevelopmentSpecialist: AliceGrant
PublishingServicesManager: ShereenJameel
ProjectManager: RukmaniKrishnan
TypesetbyMPSLimited,Chennai,India
PrintedinIndia
Lastdigitistheprintnumber:987654321
1.DataandStatistics1
1.1 Introduction 1
1.1.1 DataSources4
1.1.2 UsingtheComputer5 1.2 ObservationsandVariables5
1.3 TypesofMeasurementsforVariables7 1.4 Distributions13
1.4.1 GraphicalRepresentationofDistributions15
1.5 NumericalDescriptiveStatistics19
1.5.1 Location20
1.5.2 Dispersion24
1.5.3 OtherMeasures29
1.5.4 ComputingtheMeanandStandardDeviationfromaFrequencyDistribution31
1.5.5 ChangeofScale31
1.6 ExploratoryDataAnalysis33
1.6.1 TheStemandLeafPlot33
1.6.2 TheBoxPlot34
1.6.3 ExamplesofExploratoryDataAnalysis35
1.7 BivariateData37
1.7.1 CategoricalVariables38
1.7.2 CategoricalandIntervalVariables40
1.7.3 IntervalVariables40
1.8 Populations,Samples,andStatisticalInference APreview42
DataCollection43
2.2.1 DefinitionsandConcepts70
2.2.2 SystemReliability73
2.2.3 RandomVariables75
2.3 DiscreteProbabilityDistributions78
2.3.1 PropertiesofDiscreteProbabilityDistributions78
2.3.2 DescriptiveMeasuresforProbabilityDistributions79
2.3.3 TheDiscreteUniformDistribution80
2.3.4 TheBinomialDistribution82
2.3.5 ThePoissonDistribution84
2.4 ContinuousProbabilityDistributions85
2.4.1 CharacteristicsofaContinuousProbabilityDistribution86
2.4.2 TheContinuousUniformDistribution87
2.4.3 TheNormalDistribution88
2.4.4 CalculatingProbabilitiesUsingtheTableoftheNormalDistribution90
2.5 SamplingDistributions96
2.5.1 SamplingDistributionoftheMean96
2.5.2 UsefulnessoftheSamplingDistribution101
2.5.3 SamplingDistributionofaProportion103
2.6 OtherSamplingDistributions106
2.6.1 The x 2 Distribution107
2.6.2 DistributionoftheSampleVariance108
2.6.3 The t Distribution109
2.6.4 Usingthe t Distribution110
2.6.5 The F Distribution111
2.6.6 Usingthe F Distribution111
2.6.7 RelationshipsamongtheDistributions113
2.7 ChapterSummary114
2.8 ChapterExercises114 ConceptQuestions114 PracticeExercises115 Exercises 116
3.PrinciplesofInference123
3.1 Introduction124
3.2 HypothesisTesting125
3.2.1 GeneralConsiderations125
3.2.2 TheHypotheses126
3.2.3 RulesforMakingDecisions128
3.2.4 PossibleErrorsinHypothesisTesting129
3.2.5 ProbabilitiesofMakingErrors130
3.2.6 Choosingbetween α and β 132
3.2.7 Five-StepProcedureforHypothesisTesting133
3.2.8 WhyDoWeFocusontheTypeIError?134
3.2.9 Choosing α 135
3.2.10 TheFiveStepsforExample3.3138
3.2.11 p Values139
3.2.12 TheProbabilityofaTypeIIError141
3.2.13 Power144
3.2.14 UniformlyMostPowerfulTests145
3.2.15 One-TailedHypothesisTests145
3.3 Estimation 147
3.3.1 InterpretingtheConfidenceCoefficient149
3.3.2 RelationshipbetweenHypothesisTestingandConfidenceIntervals151
3.4 SampleSize 152
3.5 Assumptions155
3.5.1 StatisticalSignificanceversusPracticalSignificance156
3.6 ChapterSummary157
3.7 ChapterExercises159 ConceptQuestions159 PracticeExercises160 MultipleChoiceQuestions161 Exercises
4.InferencesonaSinglePopulation169
4.1 Introduction171
4.2 InferencesonthePopulationMean171
4.2.1 HypothesisTeston μ
4.2.2 Estimationof μ 175
4.2.3 SampleSize176
4.2.4 DegreesofFreedom177
4.3 InferencesonaProportionforLargeSamples178
4.3.1 HypothesisTeston p 178
4.3.2 Estimationof p 179
4.3.3 SampleSize181
4.4 InferencesontheVarianceofOnePopulation181
4.4.1 HypothesisTeston σ 2
4.4.2 Estimationof σ 2
4.5 Assumptions184
4.5.1 RequiredAssumptionsandSourcesofViolations185
4.5.2 DetectionofViolations185
4.5.3 TestsforNormality186
4.5.4 IfAssumptionsFail188
4.5.5 AlternateMethodology190
4.6 ChapterSummary191
4.7 ChapterExercises192 ConceptQuestions192 PracticeExercises192 Exercises 193 Projects 199
5.InferencesforTwoPopulations201
5.1 Introduction203
5.2 InferencesontheDifferencebetweenMeansUsingIndependentSamples204
5.2.1 SamplingDistributionofaLinearFunctionofRandomVariables204
5.2.2 TheSamplingDistributionoftheDifferencebetweenTwoMeans205
5.2.3 VariancesKnown206
5.2.4 VariancesUnknownbutAssumedEqual207
5.2.5 ThePooledVarianceEstimate207
5.2.6 The “Pooled” t Test208
5.2.7 VariancesUnknownbutNotEqual210
5.2.8 ChoosingbetweenthePooledandUnequalVariance t Tests213
5.3 InferencesonVariances214
5.4 InferencesonMeansforDependentSamples218
5.5 InferencesonProportionsforLargeSamples222
5.5.1 ComparingProportionsUsingIndependentSamples223
5.5.2 ComparingProportionsUsingPairedSamples226
5.6 AssumptionsandRemedialMethods227
5.7 ChapterSummary230
5.8 ChapterExercises231 ConceptQuestions231 PracticeExercises232
6.InferencesforTwoorMoreMeans243
6.1 Introduction244
6.6.1 UsingStatisticalSoftware245
6.2 TheAnalysisofVariance245
6.2.1 NotationandDefinitions247
6.2.2 HeuristicJustificationfortheAnalysisofVariance249
6.2.3 ComputationalFormulasandthePartitioningofSumsofSquares252
6.2.4 TheSumofSquaresbetweenMeans252
6.2.5 TheSumofSquareswithinGroups253
6.2.6 TheRatioofVariances253
6.2.7 PartitioningoftheSumsofSquares253
6.3 TheLinearModel256
6.3.1 TheLinearModelforaSinglePopulation256
6.3.2 TheLinearModelforSeveralPopulations257
6.3.3 TheAnalysisofVarianceModel257
6.3.4 FixedandRandomEffectsModel258
6.3.5 TheHypotheses258
6.3.6 ExpectedMeanSquares259
6.4 Assumptions260
6.4.1 AssumptionsandDetectionofViolations260
6.4.2 FormalTestsfortheAssumptionofEqualVariance261
6.4.3 RemedialMeasures262
6.5 SpecificComparisons265
6.5.1 Contrasts266
6.5.2 Constructinga t StatisticforaContrast267
6.5.3 PlannedContrastswithNoPattern Bonferroni’sMethod268
6.5.4 PlannedComparisonsversusControl Dunnett’sMethod269
6.5.5 PlannedAllPossiblePairwiseComparisons Fisher’sLSDandTukey’sHSD270
6.5.6 PlannedOrthogonalContrasts272
6.5.7 UnplannedContrasts Scheffé’sMethod274
6.5.8 Comments278
6.6 RandomModels278
6.7 AnalysisofMeans281
6.7.1 ANOMforProportions284
6.7.2 ANOMforCountData285
6.8 ChapterSummary287
6.9 ChapterExercises288 ConceptQuestions288 PracticeExercises289 Exercises 291 Projects 299
7.LinearRegression301
7.1 Introduction302
7.2 TheRegressionModel304
7.3 EstimationofParameters β 0 and β 1 308
7.3.1 ANoteonLeastSquares310
7.4 Estimationof σ 2 andthePartitioningofSumsofSquares312
7.5 InferencesforRegression315
7.5.1 TheAnalysisofVarianceTestfor β 1 316
7.5.2 The(Equivalent) t Testfor β 1 317
7.5.3 ConfidenceIntervalfor β 1 319
7.5.4 InferencesontheResponseVariable319
7.6 UsingStatisticalSoftware325
7.7 Correlation328
7.8 RegressionDiagnostics331
7.9 ChapterSummary337
7.10 ChapterExercises340 ConceptQuestions340 PracticeExercises341 Exercises 342 Projects 348
8.MultipleRegression351
8.1 TheMultipleRegressionModel354
8.1.1 ThePartialRegressionCoefficient356
8.2 EstimationofCoefficients357
8.2.1 SimpleLinearRegressionwithMatrices358
8.2.2 EstimatingtheParametersofaMultipleRegressionModel362
8.2.3 CorrectingfortheMean,anAlternativeCalculatingMethod363
8.3 InferentialProcedures370
8.3.1 Estimationof σ 2 andthePartitioningoftheSumsofSquares370
8.3.2 TheCoefficientofVariation372
8.3.3 InferencesforCoefficients372
8.3.4 TestsNormallyProvidedbyStatisticalSoftware375
8.3.5 TheEquivalent t StatisticforIndividualCoefficients378
8.3.6 InferencesontheResponseVariable383
8.4 Correlations384
8.4.1 MultipleCorrelation384
8.4.2 HowUsefulIsthe R2 Statistic?385
8.4.3 PartialCorrelation386
8.5 UsingStatisticalSoftware387
8.6 SpecialModels390
8.6.1 ThePolynomialModel391
8.6.2 TheMultiplicativeModel395
8.6.3 NonlinearModels399
8.7 Multicollinearity399
8.7.1 RedefiningVariables403
8.7.2 OtherMethods405
8.8 VariableSelection405
8.8.1 OtherSelectionProcedures409
8.9 DetectionofOutliers,RowDiagnostics411
8.10 ChapterSummary419
8.11 ChapterExercises423 ConceptQuestions423 PracticeExercises424
9.FactorialExperiments445
9.1 Introduction446
9.2 ConceptsandDefinitions447
9.3 TheTwo-FactorFactorialExperiment450
9.3.1 TheLinearModel450
9.3.2 Notation451
9.3.3 ComputationsfortheAnalysisofVariance452
9.3.4 Between-CellsAnalysis452
9.3.5 TheFactorialAnalysis453
9.3.6 ExpectedMeanSquares455
9.3.7 UnbalancedData459
9.4 SpecificComparisons460
9.4.1 PreplannedContrasts460
9.4.2 BasicTestStatisticforContrasts461
9.4.3 MultipleComparisons462
9.5 QuantitativeFactors468
9.5.1 LackofFit470
9.6 NoReplications472
9.7 ThreeorMoreFactors472
9.7.1 AdditionalConsiderations475
9.8 ChapterSummary475
9.9 ChapterExercises479 ConceptQuestions479 PracticeExercises480
10.DesignofExperiments493
10.1 Introduction495
10.2 TheRandomizedBlockDesign496
10.2.1 TheLinearModel498
10.2.2 RelativeEfficiency501
10.2.3 RandomTreatmentEffectsintheRandomizedBlockDesign502
10.3 RandomizedBlockswithSampling502
10.4 OtherDesigns508
10.4.1 FactorialExperimentsinaRandomizedBlockDesign509
10.4.2 NestedDesigns512
10.5 RepeatedMeasuresDesigns515
10.5.1 OneBetween-SubjectandOneWithin-SubjectFactor516
10.5.2 TwoWithin-SubjectFactors521
10.5.3 AssumptionsoftheRepeatedMeasuresModel523
10.5.4 SplitPlotDesigns524
10.5.5 AdditionalTopics529
10.6 ChapterSummary529
10.7 ChapterExercises533 ConceptQuestions533 PracticeExercises534 Exercises
11.OtherLinearModels547
11.1 Introduction547
11.2 TheDummyVariableModel549
11.2.1 FactorEffectsCoding552
11.2.2 ReferenceCellCoding552
11.2.3 ComparingCodingSchemes552
11.3 UnbalancedData554
11.4 StatisticalSoftware'sImplementationoftheDummyVariableModel556
11.5 ModelswithDummyandIntervalVariables558
11.5.1 AnalysisofCovariance560
11.5.2 MultipleCovariates564
11.5.3 UnequalSlopes565
11.5.4 IndependenceofCovariatesandFactors568
11.6 ExtensionstoOtherModels570
11.7 EstimatingLinearCombinationsofRegressionParameters570
11.7.1 CovarianceMatrices571
11.7.2 LinearCombinationofRegressionParameters572
11.8 WeightedLeastSquares574
CorrelatedErrors577
12.CategoricalData597
12.1 Introduction597
12.2 HypothesisTestsforaMultinomialPopulation598
12.3 GoodnessofFitUsingthe χ 2 Test601
12.3.1 TestforaDiscreteDistribution601
12.3.2 TestforaContinuousDistribution602
12.4 ContingencyTables604
12.4.1 ComputingtheTestStatistic605
12.4.2 TestforHomogeneity606
12.4.3 TestforIndependence608
12.4.4 MeasuresofDependence610
12.4.5 LikelihoodRatioTest611
12.4.6 Fisher'sExactTest612
12.5 SpecificComparisonsinContingencyTables614
12.6 ChapterSummary615
12.7 ChapterExercises616 ConceptQuestions616 PracticeExercises617
13.SpecialTypesofRegression623
13.1 Introduction623
13.1.1 MaximumLikelihoodandLeastSquares623
13.2 LogisticRegression625
13.3 PoissonRegression631
13.3.1 ChoosingbetweenLogisticandPoissonRegression636
13.4 NonlinearLeast-SquaresRegression638
13.4.1 SigmoidalShapes(SCurves)639
13.4.2 SymmetricUnimodalShapes639
13.5 ChapterSummary642
13.6 ChapterExercises643 ConceptQuestions643
14.NonparametricMethods651
14.1 Introduction653
14.1.1 Ranks654
14.1.2 RandomizationTests655
14.1.3 ComparingParametricandNonparametricProcedures657
14.2 OneSample658
14.3 TwoIndependentSamples662
14.4 MoreThanTwoSamples664
14.5 RandomizedBlockDesign668
14.6 RankCorrelation670
14.7 TheBootstrap672 14.8 ChapterSummary674 14.9 ChapterExercises676 ConceptQuestions676 PracticeExercises677
APPENDIXA TablesofDistributions685
A.1 TableoftheStandardNormalDistribution685
A.1A TableofCriticalValuesfortheStandardNormalDistribution686
A.2 Student’ s t Distribution Valuesexceededbyagivenprobability α 687
A.3 The χ 2 Distribution Valuesexceededbyagivenprobability α 688
A.4 The F Distribution 10%intheuppertail,P(F . c) 5 0.10689
A.4A The F Distribution 5%intheuppertail,P(F . c) 5 0.05690
A.4B The F Distribution 2.5%intheuppertail,P(F . c) 5 0.025691
A.4C The F Distribution 1%intheuppertail,P(F . c) 5 0.01692
A.5 CriticalValuesforDunnett’sTwo-SidedTestofTreatmentsversusControl693
A.6 CriticalValuesoftheStudentizedRange,forTukey’sHSD694
A.7 CriticalValuesforUsewiththeAnalysisofMeans(ANOM)695
A.8 CriticalValuesfortheWilcoxonSignedRankTest696
A.9 CriticalValuesfortheMann WhitneyRankSumsTest697 APPENDIXB
APPENDIXB ABriefIntroductiontoMatrices700.e1
B.1 MatrixAlgebra(onlineonly)700.e2 B.2 SolvingLinearEquations(onlineonly)700.e5
C.1 FloridaLakeData701
C.2 StateEducationData702
C.3 NationalAtmosphericDepositionProgram(NADP)Data703
C.4 FloridaCountyData704
C.5 CowpeaData704
C.6 JaxHousePricesData705
C.7 Gainesville,FL,WeatherData706
C.8 GeneralSocialSurvey(GSS)2016Data707
HintsforSelectedExercises709
Preface
Thegoalof StatisticalMethods,FourthEdition,istointroducethestudentbothtostatisticalreasoningandtothemostcommonlyusedstatisticaltechniques.Itisdesigned forundergraduatesinstatistics,engineering,thequantitativesciences,ormathematics, orforgraduatestudentsinawiderangeofdisciplinesrequiringstatisticalanalysisof data.Thetextcanbecoveredinatwo-semestersequence,withthefirstsemestercorrespondingtothefoundationalideasinChapters1through7andperhapsChapter12. Throughoutthetext,techniqueshavealmostuniversalapplicability.Theymaybe illustratedwithexamplesfromagricultureoreducation,buttheapplicationscouldjust haveeasilyoccurredinpublicadministrationorengineering.
Ourambitionisthatstudentswhomasterthismaterialwillbeabletoselect,implement,andinterpretthemostcommontypesofanalysesastheyundertakeresearchin theirowndisciplines.Theyshouldbeabletoreadresearcharticlesandinmostcases understandthedescriptionsofthestatisticalresultsandhowtheauthorsusedthemto reachtheirconclusions.Theyshouldunderstandthepitfallsofcollectingstatisticaldata andtherolesplayedbythevariousmathematicalassumptions.
Statisticscanbestudiedatseverallevels.Ononehand,studentscanlearnbyrote howtoplugnumbersintoformulas,ormoreoftennow,intostatisticalsoftware,and drawanumberwithaneatcirclearounditastheanswer.Thislimitedapproachrarely leadstothekindofunderstandingthatallowsstudentstocriticallyselectmethodsand interpretresults.Ontheotherhand,therearenumeroustextbooksthatprovideintroductionstotheelegantmathematicalbackgroundsofthemethods.Althoughthisisa muchdeeperunderstandingthanthefirstapproach,itsprerequisitemathematical understandingclosesittopractitionersfrommanyotherdisciplines.
Inthistext,wehavetriedtotakeamiddleway.Wepresentenoughoftheformulastomotivatethetechniques,andillustratetheirnumericalapplicationinsmallexamples.However,thefocusofthediscussionisontheselectionofthetechnique,the interpretationoftheresults,andacritiqueofthevalidityoftheanalysis.Weurgethe student(andinstructor)tofocusontheseskills.
GuidingPrinciples
• Nomathematicsbeyondalgebraisrequired.However,mathematicallyoriented studentsmaystillfindthematerialinthisbookchallenging,especiallyiftheyalso participateincoursesinstatisticaltheory.
• Formulasarepresentedprimarilytoshowthehowandwhyofaparticularstatisticalanalysis.Forthatreason,thereareaminimalnumberofexercisesthatplug numbersintoformulas.
• Allexamplesareworkedtoalogicalconclusion,includinginterpretationofresults. Wherecomputerprintoutsareused,resultsarediscussedandexplained.Ingeneral, theemphasisisonconclusionsratherthanmechanics.
• Throughoutthebookwestressthatcertainassumptionsaboutthedatamustbe fulfilledforthestatisticalanalysestobevalid,andweemphasizethatalthoughthe assumptionsareoftenfulfilled,theyshouldberoutinelychecked.
• Examplesofthestatisticaltechniques,astheyareactuallyappliedbyresearchers, arepresentedthroughoutthetext,bothinthechapterdiscussionsandinthe exercises.
• Studentswillhaveopportunitiestoworkwithdatadrawnfromavarietyof disciplines.
NewtothisEdition
• StreamlinedPresentation.Numeroussectionshavebeencompletelyrewrittenwith thegoalofamoreconcisedescriptionofthemethods.
• PracticeProblemsforEveryChapter.EverychapternowincludesPracticeExercises, withfullsolutionspresentedattheendofthetext.
• AdditionalDataSetsforProjects.Wehaveaddedthreenewdatasetsthatinstructors canuseinpreparingassignments,andwehaveupdatedtheolddatasets.
UsingthisBook Organization
Theorganizationof StatisticalMethods,FourthEdition,followstheclassicalorder.The formulasinthebookaregenerallytheso-calleddefinitionalonesthatemphasizeconceptsratherthancomputationalefficiency.Theseformulascanbeusedforafewof theverysimplestexamplesandproblems,butweexpectthatvirtuallyallexerciseswill beimplementedoncomputersusingspecial-purposestatisticalsoftware.Thefirstseven chapters,whicharenormallycoveredinafirstsemester,includedatadescription, probabilityandsamplingdistributions,thebasicsofinferenceforoneandtwosample situations,theanalysisofvariance,andone-variableregression.Thesecondportionof thebookstartswithchaptersonmultipleregression,factorialexperiments,experimentaldesign,andanintroductiontogenerallinearmodelsincludingtheanalysisof covariance.Wehaveseparatedfactorialexperimentsanddesignofexperiments becausetheyaredifferentapplicationsofthesamenumericmethods.
Thelastthreechaptersintroducetopicsintheanalysisofcategoricaldata,logistic andotherspecialtypesofregression,andnonparametricstatistics.Thesechaptersprovideabriefintroductiontotheseimportanttopicsandareintendedtoroundoutthe statisticaleducationofthosewhowilllearnfromthisbook.
Coverage
Thisbookcontainsmorematerialthancanbecoveredinatwo-semestercourse.We havepurposelydonethisfortworeasons:
• Becauseofthewidevarietyofaudiencesforstatisticalmethods,notallinstructors willwanttocoverthesamematerial.Forexample,courseswithheavyenrollments ofstudentsfromthesocialandbehavioralscienceswillwanttoemphasizenonparametricmethodsandtheanalysisofcategoricaldatawithlessemphasisonexperimentaldesign.
• Studentswhohavetakenstatisticalmethodscoursestendtokeeptheirstatistics booksforfuturereference.Werecognizethatnosinglebookwilleverserveasa completereference,butwehopethatthebroadcoverageinthisbookwillatleast leadthesestudentsintheproperdirectionwhentheoccasiondemands.
Sequencing
Forthemostpart,topicsarearrangedsothateachnewtopicbuildsonprevioustopics; hencecoursesequencingshouldfollowthebook.Thereare,however,someexceptionsthatmayappealtosomeinstructors:
• Insomecasesitmaybepreferabletopresentthematerialoncategoricaldataatan earlystage.MuchofthematerialinChapter12(CategoricalData)canbetaught anytimeafterChapter5(InferenceforTwoPopulations).
• Someinstructorsprefertopresentnonparametricmethodsalongwithparametric methods.Again,anyofthesectionsinChapter14(NonparametricMethods)maybe extractedandpresentedalongwiththeiranalogousparametrictopicinearlierchapters.
DataSets
DatafilesforallexercisesandexamplesareavailablefromthetextWebsiteat https://www. elsevier.com/books-and-journals/book-companion/9780128230435 inASCII (txt),EXCEL, andSASformat
AppendixCfullydescribeseightdatasetsdrawnfromthegeosciences,social sciences,andagriculturalsciencesthataresuitableforavarietyofsmallprojects.
Computing
Itisessentialthatstudentshaveaccesstostatisticalsoftware.Allthemethodsusedin thistextarecommonenoughsothatanymultipurposestatisticalsoftwareshould
suffice.(Thesingleexceptionisthebootstrap,attheveryendofthetext.)Forconsistencyandconvenience,andbecauseitisthemostwidelyusedsinglestatisticalcomputingpackage,wehavereliedheavilyontheSASSystemtoillustrateexamplesin thistext.However,westressthattheexamplesandexercisescouldaseasilyhavebeen doneinSPSS,Stata,R,Minitab,oranyofanumberofothersoftwarepackages.As wedemonstrateinafewcases,thevariousprintoutscontainenoughcommoninformationthat,withtheaidofdocumentation,someonewhocaninterpretresultsfrom onepackageshouldbeabletodosofromanyother.
ThistextdoesnotattempttoteachSASoranyotherstatisticalsoftware.Generic ratherthansoftware-specificinstructionsaretheonlydirectionsgivenforperforming theanalyses.Mostcommonstatisticalsoftwarehasanincreasingamountofindependentlypublishedmaterialavailable,eitherintraditionalprintoronline.Forthosewho wishtousetheSASSystem,sampleprogramsfortheexampleswithineachchapter havebeenprovidedonthetextWebsiteat https://www.elsevier.com/books-andjournals/book-companion/9780128230435.Studentsmayfindtheseofuseastemplate programsthattheycanadaptfortheexercises.
Acknowledgments
IwaspleasedwhenRudyFreundandBillWilsoninvitedmetohelpwiththeThird Edition,andhonoredtohavetheopportunitytobecomeleadauthorontheFourth Edition.Bothexperienceshaveleftmewithatremendousrespectfortheerudition, time,andjustplainhardworkthatRudyandBillputintowritingtheoriginaltext. Myrespectforthemasstatisticians,teachers,andmentorsisunbounded.Sadly,Rudy Freundpassedawayin2014.Hisreputationlivesonwiththenumeroustextsand researcharticlesthatheauthored,andwiththestudentsthatheinspired.
DonnaMohr,PhD EmeritusFacultyoftheUniversityofNorthFlorida
DataandStatistics
1.1Introduction
Tomostpeople,theword statistics conjuresupimagesofvasttablesofnumbers referringtostockprices,population,orbaseballbattingaverages.Statistics,however, actuallydenotesasystemforreasoningbasedon data.Thecollectionofthedata,
itsdescriptionthroughappropriatesummaries,andthemethodsfordrawingconclusionsfromitallformthedisciplineofstatistics.Itisthefundamentaltoolfordatadrivenreasoning.Itisappropriate,then,tobeginwithadiscussionofthecharacteristicsofdata.Thepurposeofthischapteristo
1. providethedefinitionofasetofdata, 2. definethecomponentsofsuchadataset, 3. presenttoolsthatareusedtodescribeadataset,andbriefly 4. discussmethodsofdatacollection.
Definition1.1: Asetof data isacollectionofobservedvaluesrepresentingoneormorecharacteristicsofsomeobjectsorunits.
Example1.1GSS ATypicalDataSet
Everyyear,theNationalOpinionResearchCenter(NORC)publishestheresultsofapersonalinterview surveyofU.S.households.ThissurveyiscalledtheGeneralSocialSurvey(GSS)andisthebasisformany studiesconductedinthesocialsciences.Inthe1996GSS,atotalof2904householdsweresampled andaskedover70questionsconcerninglifestyles,incomes,religiousandpoliticalbeliefs,andopinions onvarioustopics. Table1.1 liststhedataforasampleof50respondentsonfourofthequestionsasked. Thistableillustratesatypicalmidsizeddataset.Eachoftherowscorrespondstoaparticularrespondent (labeled1through50inthefirstcolumn).Eachofthecolumns,startingwithcolumntwo,areresponses tothefollowingfourquestions:
1. AGE:Therespondent’sageinyears
2. SEX:Therespondent’ssexcoded1formaleand2forfemale
3. HAPPY:Therespondent’sgeneralhappiness,coded: 1for “Nottoohappy” 2for “Prettyhappy” 3for “Veryhappy”
4. TVHOURS:TheaveragenumberofhourstherespondentwatchedTVduringaday Thisdatasetobviouslycontainsalotofinformationaboutthissampleof50respondents. Unfortunatelythisinformationishardtointe rpretwhenthedataarepresentedasshownin Table1.1.Therearejusttoomanynumberstomakeanysenseofthedata andweareonly lookingat50respondents!Bysummarizingsomeaspectsofthisdataset,wecanobtainmuch moreusableinformationandperhapsevenanswersomespecificquestions.Forexample,whatcan wesayabouttheoverallfrequencyofthevariouslevelsofhappiness?Dosomerespondentswatch alotofTV?Istherearelationshipbetweentheageoftherespondentandhisorhergeneralhappiness?IstherearelationshipbetweentheageoftherespondentandthenumberofhoursofTV watched?
Wewillreturntothisdatasetin Section1.10 afterwehaveexploredsomemethodsformaking senseofdatasetslikethisone.Aswedevelopmoresophisticatedmethodsofanalysisinlaterchapters, wewillagainrefertothisdataset.1
1 TheGSSisdiscussedonthefollowingWeb http://www.gss.norc.org
Table1.1 Sampleof50responsestothe1996GSS.
RespondentAGESEXHAPPYTVHOURS
3045223 3164235 3230222 3375220 3453223 3538120 3626122 3725231 3856233 3926221 4054225 4131220 4244120 4336223 4474220 4574223 4637230 4748123 4842226 4977222 5075130
Definition1.2: A population isadatasetrepresentingtheentireentityofinterest.
Forexample,thedecennialcensusoftheUnitedStatesyieldsadatasetcontaining informationaboutallpersonsinthecountryatthattime(theoreticallyallhouseholds correctlyfilloutthecensusforms).Thenumberofpersonsperhouseholdaslistedin thecensusdataconstitutesapopulationoffamilysizesintheUnitedStates.
Noticethatthepointofinterestdetermineswhetheradatasetisapopulation.Consider thereadingcomprehensionscoresofallthirdgradersataspecificelementaryschool.This wouldbeapopulation,ifwewereonlyinterestedinthisparticularschool.Ifweintendto makestatementsaboutabroadergroup,thenitisonlyaportionofthepopulation.
Asweshallseeindiscussionsaboutstatisticalinference,itisimportanttodefine thepopulationthatweintendtostudyverycarefully.
Definition1.3: A sample isadatasetconsistingofaportionofapopulation.Normallya sampleisobtainedinsuchawayastoberepresentativeofthepopulation.
TheCensusBureauconductsvariousactivitiesduringtheyearsbetweeneach decennialcensus,suchastheCurrentPopulationSurvey.Thissurveysamplesasmall numberofscientificallychosenhouseholdstoobtaininformationonchangesin employment,livingconditions,andotherdemographics.Thedataobtainedconstitute asamplefromthepopulationofallhouseholdsinthecountry.Similarly,iffourreadingcomprehensionscoreswereselectedforthirdgradersataspecificschool,thenthis wouldbeasampleofsizefourfromthepopulationofallthirdgraders.
1.1.1DataSources
Althoughtheemphasisinthisbookisonthestatisticalanalysisofdata,wemust emphasizethatproperdatacollectionisjustasimportantasproperanalysis.Wetouch brieflyonissuesofdatacollectionin Section1.9.Therearemanymoredetailedtexts onthissubject(forexample,Scheaffer etal.2012).Remember,eventhemostsophisticatedanalysisprocedurescannotprovidegoodresultsfrombaddata.
Ingeneral,dataareobtainedfromtwobroadcategoriesofsources:
• Primary dataarecollectedaspartofthestudy.
• Secondary dataareobtainedfrompublishedsources,suchasjournals,governmentalpublications,newsmedia,oralmanacs.
Thereareseveralwaysofobtainingprima rydata.Dataareoftenobtainedfrom simpleobservationofaprocess,suchascharacteristicsandpricesofhomessoldina particulargeographiclocation,qualityo fproductscomingoffanassemblyline, politicalopinionsofregisteredvotersinthestateofTexas,orevenapersonstandingonastreetcornerandrecordinghowma nycarspasseachhourduringtheday.
Thiskindofastudyiscalledan observationalstudy.Observationalstudiesare oftenusedtodeterminewhetheranassociationexistsbetweentwoormorecharacteristicsmeasuredinthestudy.Forexample,astudytodeterminetherelationshipbetweenhighschoolstudentperformanceandthehighesteducationallevelof thestudent ’sparentswouldbebasedonanexaminationofstudentperformance andahistoryoftheparents’ educationalexperiences.Nocause-and-effectrelationshipcouldbedetermined,butastrongassociationmightbetheresultofsucha study.Notethatanobservationalstudyd oesnotinvolveanyinterventionbythe researcher.
Oftendatausedinstudiesinvolvingstatisticscomefrom designedexperiments. Inadesignedexperimentresearchersimposetreatmentsandcontrolsontheprocess andthenobservetheresultsandtakemeasurements.Designedexperimentscanbe usedtohelpestablishcausationbetweentwoormorecharacteristics.Forexample,a studycouldbedesignedtodetermineifhighschoolstudentperformanceisaffected
byanutritiousbreakfast.Thisstudymayuseasfewas25typicalurbanhighschool students.Theresultsofthestudycouldpotentiallyshowthatchangesinbreakfast causechangesinperformance.Theresultsobservedinthesamplewouldbegeneralized,orinferred,tothepopulationofallurbanhighschoolstudents.Chapter10providesanintroductiontoexperimentaldesigns.
1.1.2UsingtheComputer
Basicstatisticalanalyses,includingmanyoftheapplicationsinChapters1through7, canbecarriedoutinspreadsheetsoftwareorevengraphingcalculators.More advancedgraphicsandanalysesarebestdonewithdedicatedstatisticalsoftware. Becauseofitscommercialimportance,wehavelargelyusedtheSASSysteminthis text,butanumberofotherpackagesareavailable.
Onecommonfeatureofalmosteverypackageisthewayfilescontainingthedata areorganized.Agoodruleofthumbis “oneobservationequalsonerow”;anotheris “onetypeofmeasurement(orvariable)isonecolumn.” Considerthedatain Table1.1.Arrangedinaspreadsheetoratextfile,thedatawouldappearmuchasin thattable,exceptthattherighthalfofthetablewouldbepastedbelowtheleft,to make50rows.Eachrowwouldcorrespondtoadifferentrespondent.Eachcolumn wouldcorrespondtoadifferentitemreportedonthatrespondent.
Althoughtheinputfileshaveacertainsimilarity,eachsoftwarepackagehasitsownstyle ofoutput.Mostwillcontainthesameresultsbutmaybearrangedandevenlabeleddifferently.Thesoftware’sdocumentationshouldfullyexplaintheinterpretationoftheresults.
1.2ObservationsandVariables
Adatasetiscomposedofinformationfromasetofunits.Informationfromaunitis knownasan observation.Anobservationconsistsofoneormorepiecesofinformationabouttheunit;thesearecalled variables.Someexamples:
• Inastudyoftheeffectivenessofanewheadacheremedy,theunitsareindividual persons,ofwhich10aregiventhenewremedyand10aregivenanaspirin.The resultingdatasethas20observationsandtwovariables:themedicationusedanda scoreindicatingtheseverityoftheheadache.
• InasurveyfordeterminingTVviewinghabits,theunitsarefamilies.Usuallythere isoneobservationforeachofthousandsoffamiliesthathavebeencontactedto participateinthesurvey.Thevariablesdescribetheprogramswatchedanddescriptionsofthecharacteristicsofthefamilies.
• Inastudytodeterminetheeffectivenessofacollegeadmissionstest(e.g.,SAT) theunitsarethefreshmenatauniversity.Thereisoneobservationperunitand thevariablesarethestudents’ scoresonthetestandtheirfirstyear’sGPA.
Variablesthatyieldnonnumericalinformationarecalled qualitative variables. Qualitativevariablesareoftenreferredtoas categorical variables.Thosethatyield numericalmeasurementsarecalled quantitative variables.Quantitativevariablescan befurtherclassifiedasdiscreteorcontinuous.Thediagrambelowsummarizesthese definitions:
QualitativeQuantitative
DiscreteContinuous
Definition1.4: A discretevariable canassumeonlyacountablenumberofvalues. Typically,discretevariablesarefrequenciesofobservationshavingspecificcharacteristics,butall discretevariablesarenotnecessarilyfrequencies.
Definition1.5: A continuousvariable isonethatcantakeanyoneofan uncountablenumberofvaluesinaninterval.Continuousvariablesareusuallymeasuredona scaleand,althoughtheymayappeardiscreteduetoimprecisemeasurement,theycanconceptually takeanyvalueinanintervalandcannotthereforebeenumerated.
Inthefieldofstatisticalqualitycontrol,theterm variabledata isusedwhenreferringtodataobtainedonacontinuousvariableand attributedata whenreferringto dataobtainedonadiscretevariable(usuallythenumberofdefectivesornonconformitiesobserved).
Intheprecedingexamples,thenamesoftheheadacheremediesandnamesofTV programswatchedarequalitative(categorical)variables.Headacheseverityscoresisa discretenumericvariable,whiletheincomesofTV-watchingfamilies,andSATand GPAscoresarecontinuousquantitativevariables.
Wewillusethedatasetin Example1.2 topresentgreaterdetailonvariousconceptsanddefinitionsregardingobservationsandvariables.
Example1.2HousingPrices
Inthefallof2001,JohnModewasofferedanewjobinamidsizedcityineastTexas.Obviously,the availabilityandcostofhousingwillinfluencehisdecisiontoaccept,soheandhiswifeMarshagoto theInternet,find www.realtor.com,andafterafewclicksfindsome500single-familyresidencesforsale inthatarea.Inordertomakethetaskofinvestigatingthehousingmarketmoremanageable,they