Edition Samprit Chatterjee Visit to download the full and correct content document: https://ebookmass.com/product/handbook-of-regression-analysis-with-applications-inr-second-edition-samprit-chatterjee/
More products digital (pdf, epub, mobi) instant download maybe you interests ...
Handbook of statistical analysis and data mining applications Second Edition Elder
https://ebookmass.com/product/handbook-of-statistical-analysisand-data-mining-applications-second-edition-elder/
Primer of applied regression and analysis of variance 3rd Edition Glantz S.A.
https://ebookmass.com/product/primer-of-applied-regression-andanalysis-of-variance-3rd-edition-glantz-s-a/
Primer of Applied Regression & Analysis of Variance 3rd edition Edition Stanton A. Glantz
https://ebookmass.com/product/primer-of-applied-regressionanalysis-of-variance-3rd-edition-edition-stanton-a-glantz/
Valuing Businesses Using Regression Analysis C. Fred Hall
https://ebookmass.com/product/valuing-businesses-usingregression-analysis-c-fred-hall/
Introduction to Linear Regression Analysis (Wiley Series in Probability and Statistics) 6th Edition Montgomery https://ebookmass.com/product/introduction-to-linear-regressionanalysis-wiley-series-in-probability-and-statistics-6th-editionmontgomery/
Random Process Analysis With R Marco Bittelli https://ebookmass.com/product/random-process-analysis-with-rmarco-bittelli/
An Introduction to Statistical Learning with Applications in R eBook https://ebookmass.com/product/an-introduction-to-statisticallearning-with-applications-in-r-ebook/
Oxford handbook of nutrition and dietetics Second Edition, Reprinted (With Corrections) Edition Gandy
https://ebookmass.com/product/oxford-handbook-of-nutrition-anddietetics-second-edition-reprinted-with-corrections-editiongandy/
Data Analysis for the Life Sciences with R 1st Edition https://ebookmass.com/product/data-analysis-for-the-lifesciences-with-r-1st-edition/
HandbookofRegressionAnalysis WithApplicationsinR WILEYSERIESINPROBABILITYANDSTATISTICS EstablishedbyWALTERA.SHEWHARTandSAMUELS.WILKS
Editors
DavidJ.Balding,NoelA.C.Cressie,GarrettM.Fitzmaurice,Harvey Goldstein,GeertMolenberghs,DavidW.Scott,AdrianF.M.Smith,and RueyS.Tsay
EditorsEmeriti
VicBarnett,RalphA.Bradley,J.StuartHunter,J.B.Kadane,DavidG. Kendall,andJozefL.Teugels
Acompletelistofthetitlesinthisseriesappearsattheendofthisvolume.
HandbookofRegression AnalysisWithApplications inR SecondEdition SampritChatterjee
NewYorkUniversity,NewYork,USA
JeffreyS.Simonoff
NewYorkUniversity,NewYork,USA
Thissecondeditionfirstpublished2020
©2020JohnWiley&Sons,Inc
EditionHistory
Wiley-Blackwell(1e,2013)
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby law.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableat http://www.wiley.com/go/ permissions
TherightofSampritChatterjeeandJefferyS.Simonofftobeidentifiedastheauthorsofthisworkhasbeen assertedinaccordancewithlaw.
RegisteredOffice
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
EditorialOffice 111RiverStreet,Hoboken,NJ07030,USA
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitus at www.wiley.com.
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.
LimitofLiability/DisclaimerofWarranty
Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentations orwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkandspecificallydisclaimall warranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityorfitnessforaparticular purpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterialsorpromotional statementsforthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasacitation and/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherandauthorsendorsethe informationorservicestheorganization,website,orproductmayprovideorrecommendationsitmaymake.This workissoldwiththeunderstandingthatthepublisherisnotengagedinrenderingprofessionalservices.The adviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaspecialist whereappropriate.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernorauthorsshallbe liableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental, consequential,orotherdamages.
LibraryofCongressCataloging-in-PublicationData
Names:Chatterjee,Samprit,1938-author.|Simonoff,JeffreyS.,author.
Title:HandbookofregressionanalysiswithapplicationsinR/Professor SampritChatterjee,NewYorkUniversity,ProfessorJeffreyS.Simonoff, NewYorkUniversity.
Othertitles:Handbookofregressionanalysis
Description:Secondedition.|Hoboken,NJ:Wiley,2020.|Series:Wiley seriesinprobabilityandstatistics|Revisededitionof:Handbookof regressionanalysis.2013.|Includesbibliographicalreferencesand index.
Identifiers:LCCN2020006580(print)|LCCN2020006581(ebook)|ISBN 9781119392378(hardback)|ISBN9781119392477(adobepdf)|ISBN 9781119392484(epub)
Subjects:LCSH:Regressionanalysis--Handbooks,manuals,etc.|R(Computer programlanguage)
Classification:LCCQA278.2.C4982020(print)|LCCQA278.2(ebook)| DDC519.5/36--dc23
LCrecordavailableathttps://lccn.loc.gov/2020006580
LCebookrecordavailableathttps://lccn.loc.gov/2020006581
CoverDesign:Wiley
CoverImage:©DmitriyRybin/Shutterstock
Setin10.82/12ptAGaramondProbySPiGlobal,Chennai,India
PrintedintheUnitedStatesofAmerica
10987654321
Dedicatedtoeveryonewholaborsinthefield ofstatistics,whethertheyarestudents, teachers,researchers,ordataanalysts.
PrefacetotheSecondEdition xv
PrefacetotheFirstEdition xix
PartI TheMultipleLinearRegressionModel
1 MultipleLinearRegression 3
1.1 Introduction 3
1.2 ConceptsandBackgroundMaterial 4
1.2.1 TheLinearRegressionModel 4
1.2.2 EstimationUsingLeastSquares 5
1.2.3 Assumptions 8
1.3 Methodology 9
1.3.1 InterpretingRegressionCoefficients 9
1.3.2 MeasuringtheStrengthoftheRegression Relationship 10
1.3.3 HypothesisTestsandConfidenceIntervals for β 12
1.3.4 FittedValuesandPredictions 13
1.3.5 CheckingAssumptionsUsingResidualPlots 14
1.4 Example—EstimatingHomePrices 15
1.5 Summary 19
2 ModelBuilding 23
2.1 Introduction 23
2.2 ConceptsandBackgroundMaterial 24
2.2.1 UsingHypothesisTeststoCompareModels 24
2.2.2 Collinearity 26
2.3 Methodology 29
2.3.1 ModelSelection 29
2.3.2 Example—EstimatingHomePrices (continued) 31
2.4 IndicatorVariablesandModelingInteractions 38
2.4.1 Example—ElectronicVotingandthe2004 PresidentialElection 40
2.5 Summary 46
3 DiagnosticsforUnusualObservations 53
3.1 Introduction 53
3.2 ConceptsandBackgroundMaterial 54
3.3 Methodology 56
3.3.1 ResidualsandOutliers 56
3.3.2 LeveragePoints 57
3.3.3 InfluentialPointsandCook’sDistance 58
3.4 Example—EstimatingHomePrices(continued) 60
3.5 Summary 63
4 TransformationsandLinearizableModels 67
4.1 Introduction 67
4.2 ConceptsandBackgroundMaterial:TheLog-LogModel 69
4.3 ConceptsandBackgroundMaterial:SemilogModels 69
4.3.1 LoggedResponseVariable 70
4.3.2 LoggedPredictorVariable 70
4.4 Example—PredictingMovieGrossesAfterOneWeek 71
4.5 Summary 77
5 TimeSeriesDataandAutocorrelation 79
5.1 Introduction 79
5.2 ConceptsandBackgroundMaterial 81
5.3 Methodology:IdentifyingAutocorrelation 83
5.3.1 TheDurbin-WatsonStatistic 83
5.3.2 TheAutocorrelationFunction(ACF) 84
5.3.3 ResidualPlotsandtheRunsTest 85
5.4 Methodology:AddressingAutocorrelation 86
5.4.1 DetrendingandDeseasonalizing 86
5.4.2 Example—e-CommerceRetailSales 87
5.4.3 LaggingandDifferencing 93
5.4.4 Example—StockIndexes 94
5.4.5 GeneralizedLeastSquares(GLS): TheCochrane-OrcuttProcedure 99
5.4.6 Example—TimeIntervalsBetweenOldFaithful GeyserEruptions 100
5.5 Summary 104
PartIII CategoricalPredictors 6 AnalysisofVariance 109
6.1 Introduction 109
6.2 ConceptsandBackgroundMaterial 110
6.2.1 One-WayANOVA 110
6.2.2 Two-WayANOVA 111
6.3 Methodology 113
6.3.1 CodingsforCategoricalPredictors 113
6.3.2 MultipleComparisons 118
6.3.3 Levene’sTestandWeightedLeastSquares 120
6.3.4 MembershipinMultipleGroups 123
6.4 Example—DVDSalesofMovies 125
6.5 Higher-WayANOVA 130
6.6 Summary 132
7 AnalysisofCovariance 135
7.1 Introduction 135
7.2 Methodology 136
7.2.1 ConstantShiftModels 136
7.2.2 VaryingSlopeModels 137
7.3 Example—InternationalGrossesofMovies 137
7.4 Summary 142
PartIV Non-GaussianRegressionModels
8 LogisticRegression 145
8.1 Introduction 145
8.2 ConceptsandBackgroundMaterial 147
8.2.1 TheLogitResponseFunction 148
8.2.2 BernoulliandBinomialRandomVariables 149
8.2.3 ProspectiveandRetrospectiveDesigns 149
8.3 Methodology 152
8.3.1 MaximumLikelihoodEstimation 152
8.3.2 Inference,ModelComparison,andModel Selection 153
8.3.3 Goodness-of-Fit 155
8.3.4 MeasuresofAssociationandClassification Accuracy 157
8.3.5 Diagnostics 159
8.4 Example—SmokingandMortality 159
8.5 Example—ModelingBankruptcy 163
8.6 Summary 168
9 MultinomialRegression 173
9.1 Introduction 173
9.2 ConceptsandBackgroundMaterial 174
9.2.1 NominalResponseVariable 174
9.2.2 OrdinalResponseVariable 176
9.3 Methodology 178
9.3.1 Estimation 178
9.3.2 Inference,ModelComparisons,andStrengthof Fit 178
9.3.3 LackofFitandViolationsof Assumptions 180
9.4 Example—CityBondRatings 180
9.5 Summary 184
10 CountRegression 187
10.1 Introduction 187
10.2 ConceptsandBackgroundMaterial 188
10.2.1 ThePoissonRandomVariable 188
10.2.2 GeneralizedLinearModels 189
10.3 Methodology 190
10.3.1 EstimationandInference 190
10.3.2 Offsets 191
10.4 OverdispersionandNegativeBinomialRegression 192
10.4.1 Quasi-likelihood 192
10.4.2 NegativeBinomialRegression 193
10.5 Example—UnprovokedSharkAttacksinFlorida 194
10.6 OtherCountRegressionModels 201
10.7 PoissonRegressionandWeightedLeastSquares 203
10.7.1 Example—InternationalGrossesofMovies (continued) 204
10.8 Summary 206
11 ModelsforTime-to-Event(Survival)Data 209
11.1 Introduction 210
11.2 ConceptsandBackgroundMaterial 211
11.2.1 TheNatureofSurvivalData 211 11.2.2 AcceleratedFailureTimeModels 212 11.2.3 TheProportionalHazardsModel 214
11.3 Methodology 214
11.3.1 TheKaplan-MeierEstimatorandtheLog-Rank Test 214
11.3.2 Parametric(Likelihood)Estimation 219
11.3.3 Semiparametric(PartialLikelihood) Estimation 221
11.3.4 TheBuckley-JamesEstimator 223 11.4 Example—TheSurvivalofBroadwayShows (continued) 223
11.5 Left-Truncated/Right-CensoredDataandTime-Varying Covariates 230
11.5.1 Left-Truncated/Right-CensoredData 230
11.5.2 Example—TheSurvivalofBroadwayShows (continued) 233
11.5.3 Time-VaryingCovariates 233
11.5.4 Example—FemaleHeadsofGovernment 235
11.6 Summary 238
PartV OtherRegressionModels 12 NonlinearRegression 243
12.1 Introduction 243
12.2 ConceptsandBackgroundMaterial 244
12.3 Methodology 246
12.3.1 NonlinearLeastSquaresEstimation 246
12.3.2 InferenceforNonlinearRegressionModels 247
12.4 Example—Michaelis-MentenEnzymeKinetics 248
12.5 Summary 252
13 ModelsforLongitudinalandNestedData 255
13.1 Introduction 255
13.2 ConceptsandBackgroundMaterial 257
13.2.1 NestedDataandANOVA 257
13.2.2 LongitudinalDataandTimeSeries 258
13.2.3 FixedEffectsVersusRandomEffects 259
13.3 Methodology 260
13.3.1 TheLinearMixedEffectsModel 260
13.3.2 TheGeneralizedLinearMixedEffectsModel 262
13.3.3 GeneralizedEstimatingEquations 262
13.3.4 NonlinearMixedEffectsModels 263
13.4 Example—TumorGrowthinaCancerStudy 264
13.5 Example—UnprovokedSharkAttacksintheUnited States 269
13.6 Summary 275
14 RegularizationMethodsandSparseModels 277
14.1 Introduction 277
14.2 ConceptsandBackgroundMaterial 278
14.2.1 TheBias–VarianceTradeoff 278
14.2.2 LargeNumbersofPredictorsandSparsity 279
14.3 Methodology 280
14.3.1 ForwardStepwiseRegression 280
14.3.2 RidgeRegression 281
14.3.3 TheLasso 281
14.3.4 OtherRegularizationMethods 283
14.3.5 ChoosingtheRegularizationParameter(s) 284
14.3.6 MoreStructuredRegressionProblems 285
14.3.7 CautionsAboutRegularizationMethods 286
14.4 Example—HumanDevelopmentIndex 287
14.5 Summary 289
PartVI NonparametricandSemiparametric Models
15 SmoothingandAdditiveModels 295
15.1 Introduction 296
15.2 ConceptsandBackgroundMaterial 296
15.2.1 TheBias–VarianceTradeoff 296
15.2.2 SmoothingandLocalRegression 297
15.3 Methodology 298
15.3.1 LocalPolynomialRegression 298
15.3.2 ChoosingtheBandwidth 298
15.3.3 SmoothingSplines 299
15.3.4 MultiplePredictors,theCurseofDimensionality,and AdditiveModels 300
15.4 Example—PricesofGermanUsedAutomobiles 301
15.5 LocalandPenalizedLikelihoodRegression 304
15.5.1 Example—TheBechdelRuleandHollywood Movies 305
15.6 UsingSmoothingtoIdentifyInteractions 307
15.6.1 Example—EstimatingHomePrices (continued) 308
15.7 Summary 310
16 Tree-BasedModels 313
16.1 Introduction 314
16.2 ConceptsandBackgroundMaterial 314
16.2.1 RecursivePartitioning 314 16.2.2 TypesofTrees 317
16.3 Methodology 318
16.3.1 CART 318
16.3.2 ConditionalInferenceTrees 319
16.3.3 EnsembleMethods 320
16.4 Examples 321
16.4.1 EstimatingHomePrices(continued) 321
16.4.2 Example—CourtesyinAirplaneTravel 322
16.5 TreesforOtherTypesofData 327
16.5.1 TreesforNestedandLongitudinalData 327
16.5.2 SurvivalTrees 328
16.6 Summary 332
Bibliography 337 Index 343
Prefacetothe SecondEdition Theyearssincethefirsteditionofthisbookappearedhavebeenfast-moving intheworldofdataanalysisandstatistics.Algorithmically-basedmethods operatingunderthebannerofmachinelearning,artificialintelligence,or datasciencehavecometotheforefrontofpublicperceptionsabouthowto analyzedata,andmorethanafewpunditshavepredictedthedemiseofclassic statisticalmodeling.
ToparaphraseMarkTwain,webelievethatreportsofthe(impending) deathofstatisticalmodelingingeneral,andregressionmodelinginparticular, areexaggerated.Thegreatadvantagethatstatisticalmodelshaveover“black box”algorithmsisthatinadditiontoeffectiveprediction,theirtransparency alsoprovidesguidanceabouttheactualunderlyingprocess(whichiscrucial fordecisionmaking),andaffordsthepossibilitiesofmakinginferencesand distinguishingrealeffectsfromrandomvariationbasedonthosemodels. Therehavebeenlaudableattemptstoencouragemakingmachinelearning algorithmsinterpretableinthewaysregressionmodelsare(Rudin,2019),but webelievethatmodelsbasedonstatisticalconsiderationsandprincipleswill haveaplaceintheanalyst’stoolkitforalongtimetocome.
Ofcourse,partofthatusefulnesscomesfromtheabilitytogeneralize regressionmodelstomorecomplexsituations,andthatisthethrustofthe changesinthisnewedition.Onethingthathasn’tchangedisthephilosophy behindthebook,andourrecommendationsonhowitcanbebestused,and weencouragethereadertorefertotheprefacetothefirsteditionforguidance onthosepoints.Therehavebeensmallchangestotheoriginalchapters,and broaddescriptionsofthosechapterscanalsobefoundintheprefacetothe firstedition.Thefivenewchapters(Chapters11,13,14,15,and16,with theformerchapter11onnonlinearregressionmovingtoChapter12)expand greatlyonthepowerandapplicabilityofregressionmodelsbeyondwhat wasdiscussedinthefirstedition.Forthisreasonmanymorereferencesare providedinthesechaptersthanintheearlierones,sincesomeofthematerial inthosechaptersislessestablishedandlesswell-known,withmuchofitstill thesubjectofactiveresearch.Inkeepingwiththat,wedonotspendmuch (orany)timeonissuesforwhichtherestillisn’tnecessarilyaconsensusinthe statisticalcommunity,butpointtobooksandmonographsthatcanhelpthe analystgetsomeperspectiveonthatkindofmaterial.
Chapter11discussesthemodelingoftime-to-eventdata,oftenreferred toassurvivaldata.Theresponsevariablemeasuresthelengthoftimeuntilan eventoccurs,andacommoncomplicatoristhatsometimesitisonlyknown
thataresponsevalueisgreaterthansomenumber;thatis,itisright-censored. Thiscannaturallyoccur,forexample,inaclinicaltrialinwhichsubjects enterthestudyatvaryingtimes,andtheeventofinteresthasnotoccurredat theendofthetrial.Analysisfocusesonthesurvivalfunction(theprobability ofsurvivingpastagiventime)andthehazardfunction(theinstantaneous probabilityoftheeventoccurringatagiventimegivensurvivaltothat time).ParametricmodelsbasedonappropriatedistributionsliketheWeibull orlog-logisticcanbefitthattakecensoringintoaccount.Semiparametric modelsliketheCoxproportionalhazardsmodel(themostcommonly-used model)andtheBuckley-Jamesestimatorarealsoavailable,whichweaken distributionalassumptions.Modelingcanbeadaptedtosituationswhere eventtimesaretruncated,andalsowhentherearecovariatesthatchangeover thelifeofthesubject.
Chapter13extendsapplicationstodatawithmultipleobservationsfor eachsubjectconsistentwithsomestructurefromtheunderlyingprocess.Such datacantaketheformofnestedorclustereddata(suchasstudentsallin oneclassroom)orlongitudinaldata(whereavariableismeasuredatmultiple timesforeachsubject).Inthissituationignoringthatstructureresultsinan inducedcorrelationthatreflectsunmodeleddifferencesbetweenclassrooms andsubjects,respectively.Mixedeffectsmodelsgeneralizeanalysisofvariance (ANOVA)modelsandtimeseriesmodelstothismorecomplicatedsituation. ModelswithlineareffectsbasedonGaussiandistributionscanbegeneralized tononlinearmodels,andalsocanbegeneralizedtonon-Gaussiandistributions throughtheuseofgeneralizedlinearmixedeffectsmodels.
Moderndataapplicationscaninvolveverylarge(evenmassive)numbersof predictors,whichcancausemajorproblemsforstandardregressionmethods. Bestsubsetsregression(discussedinChapter2)doesnotscalewelltovery largenumbersofpredictors,andChapter14discussesapproachesthatcan accomplishthat.Forwardstepwiseregression,inwhichpotentialpredictors aresteppedinoneatatime,isanalternativetobestsubsetsthatscales tomassivedatasets.Asystematicapproachtoreducingthedimensionality ofachosenregressionmodelisthroughtheuseofregularization,inwhich theusualestimationcriterionisaugmentedwithapenaltythatencourages sparsity;themostcommonly-usedversionofthisisthelassoestimator,andit anditsgeneralizationsarediscussedfurther.
Chapters15and16discussmethodsthatmoveawayfromspecified relationshipsbetweentheresponseandthepredictortononparametricand semiparametricmethods,inwhichthedataareusedtochoosetheformof theunderlyingrelationship.InChapter15linearor(specificallyspecified) nonlinearrelationshipsarereplacedwiththenotionofrelationshipstakingthe formofsmoothcurvesandsurfaces.Estimationataparticularlocationisbased onlocalinformation;thatis,thevaluesoftheresponseinalocalneighborhood ofthatlocation.Thiscanbedonethroughlocalversionsofweightedleast squares(localpolynomialestimation)orlocalregularization(smoothing splines).Suchmethodscanalsobeusedtohelpidentifyinteractionsbetween numericalpredictorsinlinearregressionmodeling.Singlepredictorsmoothing
estimatorscanbegeneralizedtomultiplepredictorsthroughtheuseofadditive functionsofsmoothcurves.Chapter16focusesonanextremelyflexibleclassof nonparametricregressionestimators,tree-basedmethods.Treesarebasedon thenotionofbinaryrecursivepartitioning.Ateachstepasetofobservations(a node)iseithersplitintotwoparts(childrennodes)onthebasisofthevaluesof achosenvariable,orisnotsplitatall,basedonencouraginghomogeneityinthe childrennodes.Thisapproachprovidesnonparametricalternativestolinear regression(regressiontrees),logisticandmultinomialregression(classification trees),acceleratedfailuretimeandproportionalhazardsregression(survival trees)andmixedeffectsregression(longitudinaltrees).
Afinalsmallchangefromthefirsteditiontothesecondeditionisinthe title,asitnowincludesthephrase WithApplicationsinR.Thisisnotreally achange,ofcourse,asalloftheanalysesinthefirsteditionwereperformed usingthestatisticspackageR.Codefortheoutputandfiguresinthebook can(still)befoundatitsassociatedwebsiteat http://people.stern .nyu.edu/jsimonof/RegressionHandbook/.Aswasthecaseinthe firstedition,eventhoughanalysesareperformedinR,westillrefertogeneral issuesrelevanttoadataanalystintheuseofstatisticalsoftwareevenifthose issuesdon’tspecificallyapplytoR.
Wewouldliketoonceagainthankourstudentsandcolleaguesfortheir encouragementandsupport,andinparticularstudentsforthetoughquestions thathavedefinitelyaffectedourviewsonstatisticalmodelingandbyextension thisbook.WewouldliketothankJonGurstelle,andlaterKathleenSantoloci andMindyOkura-Marszycki,forapproachinguswithencouragementto undertakeasecondedition.WewouldliketothankSarahKeeganforher patientsupportinbringingthebooktofruitioninherroleasProjectEditor. WewouldliketothankRoniChambersforcomputingassistance,andGlenn HellerandMarcScottforlookingatearlierdraftsofchapters.Finally,we wouldliketothankourfamiliesfortheircontinuingloveandsupport.
JEFFREY S.SIMONOFF
October,2019
SAMPRIT CHATTERJEE Brooksville,Maine
Prefacetothe FirstEdition HowtoUseThisBook Thisbookisdesignedtobeapracticalguidetoregressionmodeling.Thereis littletheoryhere,andmethodologyappearsintheserviceoftheultimategoal ofanalyzingrealdatausingappropriateregressiontools.Assuch,thetarget audienceofthebookincludesanyonewhoisfacedwithregressiondata[that is,datawherethereisaresponsevariablethatisbeingmodeledasafunction ofothervariable(s)],andwhosegoalistolearnasmuchaspossiblefrom thatdata.
Thebookcanbeusedasatextforanappliedregressioncourse(indeed, muchofitisbasedonhandoutsthathavebeengiventostudentsinsucha course),butthatisnotitsprimarypurpose;rather,itisaimedmuchmore broadlyasasourceofpracticaladviceonhowtoaddresstheproblemsthat comeupwhendealingwithregressiondata.Whileatextisusuallyorganized inawaythatmakesthechaptersinterdependent,successivelybuildingon eachother,thatisnotthecasehere.Indeed,weencouragereaderstodipinto differentchaptersforpracticaladviceonspecifictopicsasneeded.Thepace ofthebookisfasterthanmighttypicallybethecaseforatext.Thecoverage, whileatanappliedlevel,doesnotshyawayfromsophisticatedconcepts.Itis distinctfrom,forexample,ChatterjeeandHadi(2012),whilealsohavingless theoreticalfocusthantextssuchasGreene(2011),Montgomeryetal.(2012), orSenandSrivastava(1990).
This,however,isnotacookbookthatpresentsamechanicalapproachto doingregressionanalysis.Dataanalysisisperhapsanart,andcertainlyacraft; webelievethatthegoalofanydataanalysisbookshouldbetohelpanalysts developtheskillsandexperiencenecessarytoadjusttotheinevitabletwists andturnsthatcomeupwhenanalyzingrealdata.
Weassumethatthereaderpossessesanoddingacquaintancewithregressionanalysis.Thereadershouldbefamiliarwiththebasicterminologyand shouldhavebeenexposedtobasicregressiontechniquesandconcepts,atleast atthelevelofsimple(one-predictor)linearregression.Wealsoassumethat theuserhasaccesstoacomputerwithanadequateregressionpackage.The materialpresentedhereisnottiedtoanyparticularsoftware.Almostallofthe analysesdescribedherecanbeperformedbymoststandardpackages,although theeaseofdoingthiscouldvary.Alloftheanalysespresentedherewere doneusingthefreepackageR(RDevelopmentCoreTeam,2017),whichis availableformanydifferentoperatingsystemplatforms(see http://www .R-project.org/ formoreinformation).Codefortheoutputandfigures
inthebookcanbefoundatitsassociatedwebsiteat http://people .stern.nyu.edu/jsimonof/RegressionHandbook/. Eachchapterofthebookislaidoutinasimilarway,withmosthavingat leastfoursectionsofspecifictypes.Firstisanintroduction,wherethegeneral issuesthatwillbediscussedinthatchapterarepresented.Asectiononconcepts andbackgroundmaterialfollows,whereadiscussionoftherelationshipof thechapter’smaterialtothebroaderstudyofregressiondataisthefocus. Thissectionalsoprovidesanytheoreticalbackgroundforthematerialthatis necessary.Sectionsonmethodologyfollow,wherethespecifictoolsusedin thechapterarediscussed.Thisiswhererelevantalgorithmicdetailsarelikely toappear.Finally,eachchapterincludesatleastoneanalysisofrealdatausing themethodsdiscussedinthechapter(aswellasappropriatematerialfrom earlierchapters),includingbothmethodologicalandgraphicalanalyses.
Thebookbeginswithdiscussionofthemultipleregressionmodel.Many regressiontextbooksstartwithdiscussionofsimpleregressionbeforemoving ontomultipleregression.Thisisquitereasonablefromapedagogicalpoint ofview,sincesimpleregressionhasthegreatadvantageofbeingeasyto understandgraphically,butfromapracticalpointofviewsimpleregression israrelytheprimarytoolinanalysisofrealdata.Forthatreason,westart withmultipleregression,andnotethesimplificationsthatcomefromthe specialcaseofasinglepredictor.Chapter1describesthebasicsofthemultiple regressionmodel,includingtheassumptionsbeingmade,andbothestimation andinferencetools,whilealsogivinganintroductiontotheuseofresidual plotstocheckassumptions.
Sinceitisunlikelythatthefirstmodelexaminedwillultimatelybethe finalpreferredmodel,Chapter2focusesontheveryimportantareasofmodel buildingandmodelselection.Thisincludesaddressingtheissueofcollinearity, aswellastheuseofbothhypothesistestsandinformationmeasurestohelp chooseamongcandidatemodels.
Chapters3through5studycommonviolationsofregressionassumptions, andmethodsavailabletoaddressthosemodelviolations.Chapter3focuseson unusualobservations(outliersandleveragepoints),whileChapter4describes howtransformations(especiallythelogtransformation)canoftenaddressboth nonlinearityandnonconstantvarianceviolations.Chapter5isanintroduction totimeseriesregression,andtheproblemscausedbyautocorrelation.Time seriesanalysisisavastareaofstatisticalmethodology,soourgoalinthis chapterisonlytoprovideagoodpracticalintroductiontothatareainthe contextofregressionanalysis.
Chapters6and7focusonthesituationwheretherearecategoricalvariables amongthepredictors.Chapter6treatsanalysisofvariance(ANOVA)models, whichincludeonlycategoricalpredictors,whileChapter7looksatanalysisof covariance(ANCOVA)models,whichincludebothnumericalandcategorical predictors.Theexaminationofinteractioneffectsisafundamentalaspectof thesemodels,asarequestionsrelatedtosimultaneouscomparisonofmany
groupstoeachother.Dataofthistypeoftenexhibitnonconstantvariance relatedtothedifferentsubgroupsinthepopulation,andtheappropriatetool toaddressthisissue,weightedleastsquares,isalsoafocushere.
Chapters8though10examinethesituationwherethenatureofthe responsevariableissuchthatGaussian-basedleastsquaresregressionisno longerappropriate.Chapter8focusesonlogisticregression,designedfor binaryresponsedataandbasedonthebinomialrandomvariable.While therearemanyparallelsbetweenlogisticregressionanalysisandleastsquares regressionanalysis,therearealsoissuesthatcomeupinlogisticregression thatrequirespecialcare.Chapter9usesthemultinomialrandomvariableto generalizethemodelsofChapter8toallowformultiplecategoriesinthe responsevariable,outliningmodelsdesignedforresponsevariablesthateither doordonothaveorderedcategories.Chapter10focusesonresponsedatain theformofcounts,wheredistributionslikethePoissonandnegativebinomial playacentralrole.Theconnectionbetweenallthesemodelsthroughthe generalizedlinearmodelframeworkisalsoexploitedinthischapter.
Thefinalchapterfocusesonsituationswherelinearitydoesnothold, andanonlinearrelationshipisnecessary.Althoughthesemodelsarebasedon leastsquares,frombothanalgorithmicandinferentialpointofviewthere arestrongconnectionswiththemodelsofChapters8through10,whichwe highlight.
ThisHandbookcanbeusedinseveraldifferentways.First,areadermay usethebooktofindinformationonaspecifictopic.Ananalystmightwant additionalinformationon,forexample,logisticregressionorautocorrelation. Thechaptersonthese(andother)topicsprovidethereaderwiththissubject matterinformation.Asnotedabove,thechaptersalsoincludeatleastone analysisofadataset,aclarificationofcomputeroutput,andreferenceto sourceswhereadditionalmaterialcanbefound.Thechaptersinthebookare toalargeextentself-containedandcanbeconsultedindependentlyofother chapters.
Thebookcanalsobeusedasatemplateforwhatweviewasareasonable approachtodataanalysisingeneral.Thisisbasedonthecyclicalparadigm ofmodelformulation,modelfitting,modelevaluation,andmodelupdating leadingbacktomodel(re)formulation.Statisticalsignificanceofteststatistics doesnotnecessarilymeanthatanadequatemodelhasbeenobtained.Further analysisneedstobeperformedbeforethefittedmodelcanberegardedas anacceptabledescriptionofthedata,andthisbookconcentratesonthis importantaspectofregressionmethodology.Detectionofdeficienciesoffit isbasedonbothtestingandgraphicalmethods,andbothapproachesare highlightedhere.
ThisprefaceisintendedtoindicatewaysinwhichtheHandbookcan beused.Ourhopeisthatitwillbeausefulguidefordataanalysts,andwill helpcontributetoeffectiveanalyses.Wewouldliketothankourstudentsand colleaguesfortheirencouragementandsupport.Wehopewehaveprovided
themwithabookofwhichtheywouldapprove.WewouldliketothankSteve Quigley,JackiePalmieri,andAmyHendricksonfortheirhelpinbringingthis manuscripttoprint.Wewouldalsoliketothankourfamiliesfortheirlove andsupport.
SAMPRIT CHATTERJEE
Brooksville,Maine
JEFFREY S.SIMONOFF NewYork,NewYork
August,2012
Part One TheMultipleLinear RegressionModel One Chapter MultipleLinearRegression 1.1 Introduction 3
1.2 ConceptsandBackgroundMaterial 4
1.2.1 TheLinearRegressionModel 4 1.2.2 EstimationUsingLeastSquares 5
1.2.3 Assumptions 8
1.3 Methodology 9
1.3.1 InterpretingRegressionCoefficients 9
1.3.2 MeasuringtheStrengthoftheRegression Relationship 10
1.3.3 HypothesisTestsandConfidenceIntervalsfor β 12
1.3.4 FittedValuesandPredictions 13
1.3.5 CheckingAssumptionsUsingResidualPlots 14
1.4 Example—EstimatingHomePrices 15
1.5 Summary 19
1.1 Introduction
Thisisabookaboutregressionmodeling,butwhenwerefertoregression models,whatdowemean?Theregressionframeworkcanbecharacterizedin thefollowingway:
1. Wehaveoneparticularvariablethatweareinterestedinunderstanding ormodeling,suchassalesofaparticularproduct,salepriceofahome,or
HandbookofRegressionAnalysisWithApplicationsinR,SecondEdition. SampritChatterjeeandJeffreyS.Simonoff. ©2020JohnWiley&Sons,Inc.Published2020byJohnWiley&Sons,Inc.
votingpreferenceofaparticularvoter.Thisvariableiscalledthe target, response,or dependent variable,andisusuallyrepresentedby y .
2. Wehaveasetof p othervariablesthatwethinkmightbeusefulin predictingormodelingthetargetvariable(thepriceoftheproduct,the competitor’sprice,andsoon;orthelotsize,numberofbedrooms,number ofbathroomsofthehome,andsoon;orthegender,age,income,party membershipofthevoter,andsoon).Thesearecalledthe predicting,or independent variables,andareusuallyrepresentedby x1 , x2 ,etc.
Typically,aregressionanalysisisusedforone(ormore)ofthreepurposes:
1. modelingtherelationshipbetween x and y ;
2. predictionofthetargetvariable(forecasting);
3. andtestingofhypotheses.
Inthischapter,weintroducethebasicmultiplelinearregressionmodel, anddiscusshowthismodelcanbeusedforthesethreepurposes.Specifically,we discusstheinterpretationsoftheestimatesofdifferentregressionparameters, theassumptionsunderlyingthemodel,measuresofthestrengthofthe relationshipbetweenthetargetandpredictorvariables,theconstructionof testsofhypothesesandintervalsrelatedtoregressionparameters,andthe checkingofassumptionsusingdiagnosticplots.
1.2 1.2.1
ConceptsandBackgroundMaterial THELINEARREGRESSIONMODEL Thedataconsistof n observations,whicharesetsofobservedvalues {x1i ,x2i , ...,xpi ,yi } thatrepresentarandomsamplefromalargerpopulation.Itis assumedthattheseobservationssatisfyalinearrelationship,
(1.1) wherethe β coefficientsareunknownparameters,andthe εi arerandomerror terms.Bya linear model,itismeantthatthemodelislinearinthe parameters; aquadraticmodel,
paradoxicallyenough,isalinearmodel,since x and x2 arejustversionsof x1 and x2 .
Itisimportanttorecognizethatthis,oranystatisticalmodel,isnot viewedasa true representationofreality;rather,thegoalisthatthemodel bea useful representationofreality.Amodelcanbeusedtoexplorethe relationshipsbetweenvariablesandmakeaccurateforecastsbasedonthose relationshipsevenifitisnotthe“truth.”Further,anystatisticalmodelis onlytemporary,representingaprovisionalversionofviewsabouttherandom processbeingstudied.Modelscan,andshould,change,basedonanalysisusing
FIGURE1.1:Thesimplelinearregressionmodel.Thesolidlinecorresponds tothetrueregressionline,andthedottedlinescorrespondtotherandom errors εi . thecurrentmodel,selectionamongseveralcandidatemodels,theacquisition ofnewdata,newunderstandingoftheunderlyingrandomprocess,andso on.Further,itisoftenthecasethatthereareseveraldifferentmodelsthat arereasonablerepresentationsofreality.Havingsaidthis,wewillsometimes refertothe“true”model,butthisshouldbeunderstoodasreferringtothe underlyingformofthecurrentlyhypothesizedrepresentationoftheregression relationship.
Thespecialcaseof(1.1)with p =1 correspondstothe simpleregression model,andisconsistentwiththerepresentationinFigure1.1.Thesolidline isthetrueregressionline,theexpectedvalueof y giventhevalueof x.The dottedlinesaretherandomerrors εi thataccountforthelackofaperfect associationbetweenthepredictorandthetargetvariables.
1.2.2 ESTIMATIONUSINGLEASTSQUARES Thetrueregressionfunctionrepresentstheexpectedrelationshipbetweenthe targetandthepredictorvariables,whichisunknown.Aprimarygoalofa regressionanalysisistoestimatethisrelationship,orequivalently,toestimate theunknownparameters β.Thisrequiresadata-basedrule,orcriterion, thatwillgiveareasonableestimate.Thestandardapproachis leastsquares regression,wheretheestimatesarechosentominimize
Figure1.2givesagraphicalrepresentationofleastsquaresthatisbased onFigure1.1.Nowthetrueregressionlineisrepresentedbythegrayline,
FIGURE1.2:Leastsquaresestimationforthesimplelinearregressionmodel, usingthesamedataasinFigure1.1.Thegraylinecorrespondstothetrue regressionline,thesolidblacklinecorrespondstothefittedleastsquares line(designedtoestimatethegrayline),andthelengthsofthedottedlines correspondtotheresiduals.Thesumofsquaredvaluesofthelengthsofthe dottedlinesisminimizedbythesolidblackline.
andthesolidblacklineistheestimatedregressionline,designedtoestimate the(unknown)graylineascloselyaspossible.Foranychoiceofestimated parameters ˆ β,theestimatedexpectedresponsevaluegiventheobserved predictorvaluesequals
andiscalledthe fittedvalue.Thedifferencebetweentheobservedvalue yi andthefittedvalue ˆ yi iscalledthe residual,thesetofwhichisrepresentedby thesignedlengthsofthedottedlinesinFigure1.2.Theleastsquaresregression lineminimizesthesumofsquaresofthelengthsofthedottedlines;thatis, theordinaryleastsquares(OLS)estimatesminimizethesumofsquaresofthe residuals.
Inhigherdimensions(p> 1),thetrueandestimatedregressionrelationshipscorrespondtoplanes(p =2)orhyperplanes(p ≥ 3),butotherwisethe principlesarethesame.Figure1.3illustratesthecasewithtwopredictors. Thelengthofeachverticallinecorrespondstoaresidual(solidlinesreferto positiveresiduals,whiledashedlinesrefertonegativeresiduals),andthe(least squares)planethatgoesthroughtheobservationsischosentominimizethe sumofsquaresoftheresiduals.
FIGURE1.3:Leastsquaresestimationforthemultiplelinearregression modelwithtwopredictors.Theplanecorrespondstothefittedleastsquares relationship,andthelengthsoftheverticallinescorrespondtotheresiduals. Thesumofsquaredvaluesofthelengthsoftheverticallinesisminimizedby theplane.
Thelinearregressionmodelcanbewrittencompactlyusingmatrix notation.Definethefollowingmatrixandvectorsasfollows:
Theregressionmodel(1.1)isthen
Thenormalequations[whichdeterminetheminimizerof(1.2)]canbe shown(usingmultivariatecalculus)tobe
whichimpliesthattheleastsquaresestimatessatisfy
Thefittedvaluesarethen
where H = X (X X ) 1 X istheso-called“hat”matrix(sinceittakes y to ˆ y).
Theresiduals e = y ˆ y thussatisfy
or
1.2.3 ASSUMPTIONS Theleastsquarescriterionwillnotnecessarilyyieldsensibleresultsunless certainassumptionshold.Oneisgivenin(1.1)—thelinearmodelshould beappropriate.Inaddition,thefollowingassumptionsareneededtojustify usingleastsquaresregression.
1. Theexpectedvalueoftheerrorsiszero(E (εi )=0 forall i).Thatis,it cannotbetruethatforcertainobservationsthemodelissystematically toolow,whileforothersitissystematicallytoohigh.Aviolationofthis assumptionwillleadtodifficultiesinestimating β0 .Moreimportantly, thisreflectsthatthemodeldoesnotincludeanecessarysystematic component,whichhasinsteadbeenabsorbedintotheerrorterms.
2. Thevarianceoftheerrorsisconstant(V (εi )= σ 2 forall i).Thatis, itcannotbetruethatthestrengthofthemodelisgreaterforsome partsofthepopulation(smaller σ )andlessforotherparts(larger σ ). Thisassumptionofconstantvarianceiscalled homoscedasticity,andits violation(nonconstantvariance)iscalled heteroscedasticity.Aviolation ofthisassumptionmeansthattheleastsquaresestimatesarenotasefficient astheycouldbeinestimatingthetrueparameters,andbetterestimatesare available.Moreimportantly,italsoresultsinpoorlycalibratedconfidence and(especially)predictionintervals.
3. Theerrorsareuncorrelatedwitheachother.Thatis,itcannotbetrue thatknowingthatthemodelunderpredicts y (forexample)forone particularobservationsaysanythingatallaboutwhatitdoesforany otherobservation.Thisviolationmostoftenoccursindatathatare orderedintime(timeseriesdata),whereerrorsthatareneareachother intimeareoftensimilartoeachother(suchtime-relatedcorrelation iscalled autocorrelation).Violationofthisassumptionmeansthatthe leastsquaresestimatesarenotasefficientastheycouldbeinestimating thetrueparameters,andmoreimportantly,itspresencecanleadtovery misleadingassessmentsofthestrengthoftheregression.
4. Theerrorsarenormallydistributed.Thisisneededifwewanttoconstruct anyconfidenceorpredictionintervals,orhypothesistests,whichwe usuallydo.Ifthisassumptionisviolated,hypothesistestsandconfidence andpredictionintervalscanbeverymisleading.
Sinceviolationoftheseassumptionscanpotentiallyleadtocompletely misleadingresults,afundamentalpartofanyregressionanalysisistocheck themusingvariousplots,tests,anddiagnostics.
1.3 Methodology 1.3.1
INTERPRETINGREGRESSIONCOEFFICIENTS Theleastsquaresregressioncoefficientshaveveryspecificmeanings.Theyare oftenmisinterpreted,soitisimportanttobeclearonwhattheymean(anddo notmean).Considerfirsttheintercept, ˆ β0
ˆ
β0 :Theestimatedexpectedvalueofthetargetvariablewhenthepredictors areallequaltozero.
Notethatthismightnothaveanyphysicalinterpretation,sinceazerovaluefor thepredictor(s)mightbeimpossible,ormightnevercomeclosetooccurring intheobserveddata.Inthatsituation,itispointlesstotrytointerpret thisvalue.Ifallofthepredictorsarecenteredtohavezeromean,then ˆ β0 necessarilyequals Y ,thesamplemeanofthetargetvalues.Notethatifthere isanyparticularvalueforeachpredictorthatismeaningfulinsomesense,if eachvariableiscenteredarounditsparticularvalue,thentheinterceptisan estimateof E (y ) whenthepredictorsallhavethosemeaningfulvalues.
Theestimatedcoefficientforthe j thpredictor(j =1,...,p)isinterpreted inthefollowingway:
ˆ βj :Theestimatedexpectedchangeinthetargetvariableassociatedwithaone unitchangeinthe j thpredictingvariable,holdingallelseinthemodel fixed.
Thereareseveralnoteworthyaspectstothisinterpretation.First,notethe word associated —wecannotsaythatachangeinthetargetvariableis caused byachangeinthepredictor,onlythattheyareassociatedwitheachother. Thatis,correlationdoesnotimplycausation.
Anotherkeypointisthephrase“holdingallelseinthemodelfixed,”the implicationsofwhichareoftenignored.Considerthefollowinghypothetical example.Arandomsampleofcollegestudentsataparticularuniversityis takeninordertounderstandtherelationshipbetweencollegegradepoint average(GPA)andothervariables.AmodelisbuiltwithcollegeGPAasa functionofhighschoolGPAandthestandardizedScholasticAptitudeTest (SAT),withresultantleastsquaresfit
CollegeGPA =1 3+ 7 × HighSchoolGPA 0001 × SAT
Itistemptingtosay(andmanypeoplewouldsay)thatthecoefficientfor SATscorehasthe“wrongsign,”becauseitsaysthathighervaluesofSAT
areassociatedwithlowervaluesofcollegeGPA.Thisis not correct.The problemisthatitislikelyinthiscontextthatwhatananalystwouldfind intuitiveisthe marginal relationshipbetweencollegeGPAandSATscorealone (ignoringallelse),onethatwewouldindeedexpecttobeadirect(positive) one.Theregressioncoefficientdoesnotsayanythingaboutthatmarginal relationship.Rather,itreferstotheconditional(sometimescalledpartial) relationshipthattakesthehighschoolGPAasfixed,whichisapparently thathighervaluesofSATareassociatedwithlowervaluesofcollegeGPA, holdinghighschoolGPAfixed.HighschoolGPAandSATarenodoubt relatedtoeachother,anditisquitelikelythatthisrelationshipbetween thepredictorswouldcomplicateanyunderstandingof,orintuitionabout, theconditionalrelationshipbetweencollegeGPAandSATscore.Multiple regressioncoefficientsshouldnotbeinterpretedmarginally;ifyoureallyare interestedintherelationshipbetweenthetargetandasinglepredictoralone, youshouldsimplydoaregressionofthetargetononlythatvariable.This doesnotmeanthatmultipleregressioncoefficientsareuninterpretable,only thatcareisnecessarywheninterpretingthem.
Anothercommonuseofmultipleregressionthatdependsonthisconditionalinterpretationofthecoefficientsistoexplicitlyinclude“control” variablesinamodelinordertotrytoaccountfortheireffectstatistically.This isparticularlyimportantinobservationaldata(datathatarenottheresultofa designedexperiment),sinceinthatcase,theeffectsofothervariablescannotbe ignoredasaresultofrandomassignmentintheexperiment.Forobservational dataitisnotpossibletophysicallyinterveneintheexperimentto“holdother variablesfixed,”butthemultipleregressionframeworkeffectivelyallowsthis tobedonestatistically.
Havingsaidthis,wemustrecognizethatinmanysituations,itisimpossible fromapracticalpointofviewtochangeonepredictorwhileholdingallelse fixed.Thus,whilewewouldliketointerpretacoefficientasaccountingforthe presenceofotherpredictorsinaphysicalsense,itisimportant(whendealing withobservationaldatainparticular)torememberthatlinearregressionisat bestonlyanapproximationtotheactualunderlyingrandomprocess.
1.3.2
MEASURINGTHESTRENGTHOFTHEREGRESSION RELATIONSHIP Theleastsquaresestimatespossessanimportantproperty:
Thisformulasaysthatthevariabilityinthetargetvariable(theleftsideof theequation,termedthecorrectedtotalsumofsquares)canbesplitintotwo mutuallyexclusiveparts—thevariabilityleftoverafterdoingtheregression (thefirsttermontherightside,theresidualsumofsquares),andthevariability accountedforbydoingtheregression(thesecondterm,theregressionsumof
squares).Thisimmediatelysuggeststheusefulnessof R2 asameasureofthe strengthoftheregressionrelationship,where
R2 = i (ˆ yi Y )2 i (yi Y )2 ≡
RegressionSS CorrectedtotalSS =1
ResidualSS CorrectedtotalSS .
The R2 value(alsocalledthe coefficientofdetermination)estimatesthe populationproportionofvariabilityin y accountedforbythebestlinear combinationofthepredictors.Valuescloserto 1 indicateagooddealof predictivepowerofthepredictorsforthetargetvariable,whilevaluescloser to 0 indicatelittlepredictivepower.Anequivalentrepresentationof R2 is R2 =corr(yi , ˆ yi )2 , where corr(yi , ˆ yi )=
(yi Y )(ˆ yi ˆ Y )
isthesamplecorrelationcoefficientbetween y and ˆ y (thiscorrelationiscalled themultiplecorrelationcoefficient).Thatis, R2 isadirectmeasureofhow similartheobservedandfittedtargetvaluesare.
Itcanbeshownthat R2 isbiasedupwardsasanestimateofthepopulation proportionofvariabilityaccountedforbytheregression.The adjusted R2 correctsthisbias,andequals
Itisapparentfrom(1.7)thatunless p islargerelativeto n p 1 (thatis, unlessthenumberofpredictorsislargerelativetothesamplesize), R2 and R2 a willbeclosetoeachother,andthechoiceofwhichtouseisaminor concern.Whatisperhapsmoreinterestingisthenatureof R2 a asprovidingan explicittradeoffbetweenthestrengthofthefit(thefirstterm,withlarger R2 correspondingtostrongerfitandlarger R2 a )andthecomplexityofthemodel (thesecondterm,withlarger p correspondingtomorecomplexityandsmaller R2 a ).Thistradeoffoffidelitytothedataversussimplicitywillbeimportantin thediscussionofmodelselectioninSection2.3.1.
Theonlyparameterleftunaccountedforintheestimationschemeisthe varianceoftheerrors σ 2 .Anunbiasedestimateisprovidedbythe residual meansquare
Thisestimatehasadirect,butoftenunderappreciated,useinassessing thepracticalimportanceofthemodel.Doesknowing x1 ,...,xp really sayanythingofvalueabout y ?Thisisn’taquestionthatcanbeanswered completelystatistically;itrequiresknowledgeandunderstandingofthedata andtheunderlyingrandomprocess(thatis,itrequirescontext).Recallthat themodelassumesthattheerrorsarenormallydistributedwithstandard
deviation σ .Thismeansthat,roughlyspeaking, 95% ofthetimeanobserved y valuefallswithin ±2σ oftheexpectedresponse
(y )= β0 + β1 x1 + ··· + βp xp .
E (y ) canbeestimatedforanygivensetof x valuesusing
, whilethesquarerootoftheresidualmeansquare(1.8),termedthe standard erroroftheestimate,providesanestimateof σ thatcanbeusedinconstructing thisroughpredictioninterval ±2ˆ σ .
1.3.3 HYPOTHESISTESTSANDCONFIDENCEINTERVALS FOR β Therearetwotypesofhypothesistestsofimmediateinterestrelatedtothe regressioncoefficients.
1. Do any ofthepredictorsprovidepredictivepowerforthetargetvariable?
Thisisatestoftheoverallsignificanceoftheregression,
0 : β1 = = βp =0 versus Ha :atleastone βj =0,j =1,...,p.
Thetestofthesehypothesesisthe F -test, F = RegressionMS ResidualMS ≡ RegressionSS/p ResidualSS/(n p 1)
Thisisreferencedagainstanull F -distributionon (p,n p 1) degrees offreedom.
2. Giventheothervariablesinthemodel,doesaparticularpredictorprovide additionalpredictivepower?Thiscorrespondstoatestofthesignificance ofanindividualcoefficient,
Thisistestedusinga t-test,
whichiscomparedtoa t-distributionon n p 1 degreesoffreedom. Othervaluesof βj canbespecifiedinthenullhypothesis(say βj 0 ),with the t-statisticbecoming
(1.9)