Issuu

Edition Samprit Chatterjee

Visit to download the full and correct content document: https://ebookmass.com/product/handbook-of-regression-analysis-with-applications-inr-second-edition-samprit-chatterjee/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Handbook of statistical analysis and data mining applications Second Edition Elder

https://ebookmass.com/product/handbook-of-statistical-analysisand-data-mining-applications-second-edition-elder/

Primer of applied regression and analysis of variance 3rd Edition Glantz S.A.

https://ebookmass.com/product/primer-of-applied-regression-andanalysis-of-variance-3rd-edition-glantz-s-a/

Primer of Applied Regression & Analysis of Variance 3rd edition Edition Stanton A. Glantz

https://ebookmass.com/product/primer-of-applied-regressionanalysis-of-variance-3rd-edition-edition-stanton-a-glantz/

Valuing Businesses Using Regression Analysis C. Fred Hall

https://ebookmass.com/product/valuing-businesses-usingregression-analysis-c-fred-hall/

Introduction to Linear Regression Analysis (Wiley Series in Probability and Statistics) 6th Edition Montgomery

https://ebookmass.com/product/introduction-to-linear-regressionanalysis-wiley-series-in-probability-and-statistics-6th-editionmontgomery/

Random Process Analysis With R Marco Bittelli

https://ebookmass.com/product/random-process-analysis-with-rmarco-bittelli/

An Introduction to Statistical Learning with Applications in R eBook

https://ebookmass.com/product/an-introduction-to-statisticallearning-with-applications-in-r-ebook/

Oxford handbook of nutrition and dietetics Second Edition, Reprinted (With Corrections) Edition Gandy

https://ebookmass.com/product/oxford-handbook-of-nutrition-anddietetics-second-edition-reprinted-with-corrections-editiongandy/

Data Analysis for the Life Sciences with R 1st Edition

https://ebookmass.com/product/data-analysis-for-the-lifesciences-with-r-1st-edition/

HandbookofRegressionAnalysis WithApplicationsinR

WILEYSERIESINPROBABILITYANDSTATISTICS

EstablishedbyWALTERA.SHEWHARTandSAMUELS.WILKS

Editors

DavidJ.Balding,NoelA.C.Cressie,GarrettM.Fitzmaurice,Harvey Goldstein,GeertMolenberghs,DavidW.Scott,AdrianF.M.Smith,and RueyS.Tsay

EditorsEmeriti

VicBarnett,RalphA.Bradley,J.StuartHunter,J.B.Kadane,DavidG. Kendall,andJozefL.Teugels

Acompletelistofthetitlesinthisseriesappearsattheendofthisvolume.

HandbookofRegression AnalysisWithApplications inR

SecondEdition SampritChatterjee

NewYorkUniversity,NewYork,USA

JeffreyS.Simonoff

NewYorkUniversity,NewYork,USA

Thissecondeditionfirstpublished2020

EditionHistory

Wiley-Blackwell(1e,2013)

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,in anyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptaspermittedby law.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableat http://www.wiley.com/go/ permissions

TherightofSampritChatterjeeandJefferyS.Simonofftobeidentifiedastheauthorsofthisworkhasbeen assertedinaccordancewithlaw.

RegisteredOffice

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

EditorialOffice 111RiverStreet,Hoboken,NJ07030,USA

Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitus at www.wiley.com.

Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.

LimitofLiability/DisclaimerofWarranty

Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentations orwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkandspecificallydisclaimall warranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityorfitnessforaparticular purpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterialsorpromotional statementsforthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasacitation and/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherandauthorsendorsethe informationorservicestheorganization,website,orproductmayprovideorrecommendationsitmaymake.This workissoldwiththeunderstandingthatthepublisherisnotengagedinrenderingprofessionalservices.The adviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaspecialist whereappropriate.Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernorauthorsshallbe liableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial,incidental, consequential,orotherdamages.

LibraryofCongressCataloging-in-PublicationData

Names:Chatterjee,Samprit,1938-author.|Simonoff,JeffreyS.,author.

Title:HandbookofregressionanalysiswithapplicationsinR/Professor SampritChatterjee,NewYorkUniversity,ProfessorJeffreyS.Simonoff, NewYorkUniversity.

Othertitles:Handbookofregressionanalysis

Description:Secondedition.|Hoboken,NJ:Wiley,2020.|Series:Wiley seriesinprobabilityandstatistics|Revisededitionof:Handbookof regressionanalysis.2013.|Includesbibliographicalreferencesand index.

Identifiers:LCCN2020006580(print)|LCCN2020006581(ebook)|ISBN 9781119392378(hardback)|ISBN9781119392477(adobepdf)|ISBN 9781119392484(epub)

Subjects:LCSH:Regressionanalysis--Handbooks,manuals,etc.|R(Computer programlanguage)

Classification:LCCQA278.2.C4982020(print)|LCCQA278.2(ebook)| DDC519.5/36--dc23

LCrecordavailableathttps://lccn.loc.gov/2020006580

LCebookrecordavailableathttps://lccn.loc.gov/2020006581

CoverDesign:Wiley

CoverImage:©DmitriyRybin/Shutterstock

Setin10.82/12ptAGaramondProbySPiGlobal,Chennai,India

PrintedintheUnitedStatesofAmerica

10987654321

Dedicatedtoeveryonewholaborsinthefield ofstatistics,whethertheyarestudents, teachers,researchers,ordataanalysts.

PrefacetotheSecondEdition xv

PrefacetotheFirstEdition xix

PartI

TheMultipleLinearRegressionModel

1 MultipleLinearRegression 3

1.1 Introduction 3

1.2 ConceptsandBackgroundMaterial 4

1.2.1 TheLinearRegressionModel 4

1.2.2 EstimationUsingLeastSquares 5

1.2.3 Assumptions 8

1.3 Methodology 9

1.3.1 InterpretingRegressionCoefficients 9

1.3.2 MeasuringtheStrengthoftheRegression Relationship 10

1.3.3 HypothesisTestsandConfidenceIntervals for β 12

1.3.4 FittedValuesandPredictions 13

1.3.5 CheckingAssumptionsUsingResidualPlots 14

1.4 Example—EstimatingHomePrices 15

1.5 Summary 19

2 ModelBuilding 23

2.1 Introduction 23

2.2 ConceptsandBackgroundMaterial 24

2.2.1 UsingHypothesisTeststoCompareModels 24

2.2.2 Collinearity 26

2.3 Methodology 29

2.3.1 ModelSelection 29

2.3.2 Example—EstimatingHomePrices (continued) 31

2.4 IndicatorVariablesandModelingInteractions 38

2.4.1 Example—ElectronicVotingandthe2004 PresidentialElection 40

2.5 Summary 46

3 DiagnosticsforUnusualObservations 53

3.1 Introduction 53

3.2 ConceptsandBackgroundMaterial 54

3.3 Methodology 56

3.3.1 ResidualsandOutliers 56

3.3.2 LeveragePoints 57

3.3.3 InfluentialPointsandCook’sDistance 58

3.4 Example—EstimatingHomePrices(continued) 60

3.5 Summary 63

4 TransformationsandLinearizableModels 67

4.1 Introduction 67

4.2 ConceptsandBackgroundMaterial:TheLog-LogModel 69

4.3 ConceptsandBackgroundMaterial:SemilogModels 69

4.3.1 LoggedResponseVariable 70

4.3.2 LoggedPredictorVariable 70

4.4 Example—PredictingMovieGrossesAfterOneWeek 71

4.5 Summary 77

5 TimeSeriesDataandAutocorrelation 79

5.1 Introduction 79

5.2 ConceptsandBackgroundMaterial 81

5.3 Methodology:IdentifyingAutocorrelation 83

5.3.1 TheDurbin-WatsonStatistic 83

5.3.2 TheAutocorrelationFunction(ACF) 84

5.3.3 ResidualPlotsandtheRunsTest 85

5.4 Methodology:AddressingAutocorrelation 86

5.4.1 DetrendingandDeseasonalizing 86

5.4.2 Example—e-CommerceRetailSales 87

5.4.3 LaggingandDifferencing 93

5.4.4 Example—StockIndexes 94

5.4.5 GeneralizedLeastSquares(GLS): TheCochrane-OrcuttProcedure 99

5.4.6 Example—TimeIntervalsBetweenOldFaithful GeyserEruptions 100

5.5 Summary 104

PartIII

CategoricalPredictors

6 AnalysisofVariance 109

6.1 Introduction 109

6.2 ConceptsandBackgroundMaterial 110

6.2.1 One-WayANOVA 110

6.2.2 Two-WayANOVA 111

6.3 Methodology 113

6.3.1 CodingsforCategoricalPredictors 113

6.3.2 MultipleComparisons 118

6.3.3 Levene’sTestandWeightedLeastSquares 120

6.3.4 MembershipinMultipleGroups 123

6.4 Example—DVDSalesofMovies 125

6.5 Higher-WayANOVA 130

6.6 Summary 132

7 AnalysisofCovariance 135

7.1 Introduction 135

7.2 Methodology 136

7.2.1 ConstantShiftModels 136

7.2.2 VaryingSlopeModels 137

7.3 Example—InternationalGrossesofMovies 137

7.4 Summary 142

PartIV Non-GaussianRegressionModels

8 LogisticRegression 145

8.1 Introduction 145

8.2 ConceptsandBackgroundMaterial 147

8.2.1 TheLogitResponseFunction 148

8.2.2 BernoulliandBinomialRandomVariables 149

8.2.3 ProspectiveandRetrospectiveDesigns 149

8.3 Methodology 152

8.3.1 MaximumLikelihoodEstimation 152

8.3.2 Inference,ModelComparison,andModel Selection 153

8.3.3 Goodness-of-Fit 155

8.3.4 MeasuresofAssociationandClassification Accuracy 157

8.3.5 Diagnostics 159

8.4 Example—SmokingandMortality 159

8.5 Example—ModelingBankruptcy 163

8.6 Summary 168

9 MultinomialRegression 173

9.1 Introduction 173

9.2 ConceptsandBackgroundMaterial 174

9.2.1 NominalResponseVariable 174

9.2.2 OrdinalResponseVariable 176

9.3 Methodology 178

9.3.1 Estimation 178

9.3.2 Inference,ModelComparisons,andStrengthof Fit 178

9.3.3 LackofFitandViolationsof Assumptions 180

9.4 Example—CityBondRatings 180

9.5 Summary 184

10 CountRegression 187

10.1 Introduction 187

10.2 ConceptsandBackgroundMaterial 188

10.2.1 ThePoissonRandomVariable 188

10.2.2 GeneralizedLinearModels 189

10.3 Methodology 190

10.3.1 EstimationandInference 190

10.3.2 Offsets 191

10.4 OverdispersionandNegativeBinomialRegression 192

10.4.1 Quasi-likelihood 192

10.4.2 NegativeBinomialRegression 193

10.5 Example—UnprovokedSharkAttacksinFlorida 194

10.6 OtherCountRegressionModels 201

10.7 PoissonRegressionandWeightedLeastSquares 203

10.7.1 Example—InternationalGrossesofMovies (continued) 204

10.8 Summary 206

11 ModelsforTime-to-Event(Survival)Data 209

11.1 Introduction 210

11.2 ConceptsandBackgroundMaterial 211

11.2.1 TheNatureofSurvivalData 211 11.2.2 AcceleratedFailureTimeModels 212 11.2.3 TheProportionalHazardsModel 214

11.3 Methodology 214

11.3.1 TheKaplan-MeierEstimatorandtheLog-Rank Test 214

11.3.2 Parametric(Likelihood)Estimation 219

11.3.3 Semiparametric(PartialLikelihood) Estimation 221

11.3.4 TheBuckley-JamesEstimator 223 11.4 Example—TheSurvivalofBroadwayShows (continued) 223

11.5 Left-Truncated/Right-CensoredDataandTime-Varying Covariates 230

11.5.1 Left-Truncated/Right-CensoredData 230

11.5.2 Example—TheSurvivalofBroadwayShows (continued) 233

11.5.3 Time-VaryingCovariates 233

11.5.4 Example—FemaleHeadsofGovernment 235

11.6 Summary 238

PartV

OtherRegressionModels

12 NonlinearRegression 243

12.1 Introduction 243

12.2 ConceptsandBackgroundMaterial 244

12.3 Methodology 246

12.3.1 NonlinearLeastSquaresEstimation 246

12.3.2 InferenceforNonlinearRegressionModels 247

12.4 Example—Michaelis-MentenEnzymeKinetics 248

12.5 Summary 252

13 ModelsforLongitudinalandNestedData 255

13.1 Introduction 255

13.2 ConceptsandBackgroundMaterial 257

13.2.1 NestedDataandANOVA 257

13.2.2 LongitudinalDataandTimeSeries 258

13.2.3 FixedEffectsVersusRandomEffects 259

13.3 Methodology 260

13.3.1 TheLinearMixedEffectsModel 260

13.3.2 TheGeneralizedLinearMixedEffectsModel 262

13.3.3 GeneralizedEstimatingEquations 262

13.3.4 NonlinearMixedEffectsModels 263

13.4 Example—TumorGrowthinaCancerStudy 264

13.5 Example—UnprovokedSharkAttacksintheUnited States 269

13.6 Summary 275

14 RegularizationMethodsandSparseModels 277

14.1 Introduction 277

14.2 ConceptsandBackgroundMaterial 278

14.2.1 TheBias–VarianceTradeoff 278

14.2.2 LargeNumbersofPredictorsandSparsity 279

14.3 Methodology 280

14.3.1 ForwardStepwiseRegression 280

14.3.2 RidgeRegression 281

14.3.3 TheLasso 281

14.3.4 OtherRegularizationMethods 283

14.3.5 ChoosingtheRegularizationParameter(s) 284

14.3.6 MoreStructuredRegressionProblems 285

14.3.7 CautionsAboutRegularizationMethods 286

14.4 Example—HumanDevelopmentIndex 287

14.5 Summary 289

PartVI

NonparametricandSemiparametric Models

15 SmoothingandAdditiveModels 295

15.1 Introduction 296

15.2 ConceptsandBackgroundMaterial 296

15.2.1 TheBias–VarianceTradeoff 296

15.2.2 SmoothingandLocalRegression 297

15.3 Methodology 298

15.3.1 LocalPolynomialRegression 298

15.3.2 ChoosingtheBandwidth 298

15.3.3 SmoothingSplines 299

15.3.4 MultiplePredictors,theCurseofDimensionality,and AdditiveModels 300

15.4 Example—PricesofGermanUsedAutomobiles 301

15.5 LocalandPenalizedLikelihoodRegression 304

15.5.1 Example—TheBechdelRuleandHollywood Movies 305

15.6 UsingSmoothingtoIdentifyInteractions 307

15.6.1 Example—EstimatingHomePrices (continued) 308

15.7 Summary 310

16 Tree-BasedModels 313

16.1 Introduction 314

16.2 ConceptsandBackgroundMaterial 314

16.2.1 RecursivePartitioning 314 16.2.2 TypesofTrees 317

16.3 Methodology 318

16.3.1 CART 318

16.3.2 ConditionalInferenceTrees 319

16.3.3 EnsembleMethods 320

16.4 Examples 321

16.4.1 EstimatingHomePrices(continued) 321

16.4.2 Example—CourtesyinAirplaneTravel 322

16.5 TreesforOtherTypesofData 327

16.5.1 TreesforNestedandLongitudinalData 327

16.5.2 SurvivalTrees 328

16.6 Summary 332

Bibliography 337 Index 343

Prefacetothe SecondEdition

Theyearssincethefirsteditionofthisbookappearedhavebeenfast-moving intheworldofdataanalysisandstatistics.Algorithmically-basedmethods operatingunderthebannerofmachinelearning,artificialintelligence,or datasciencehavecometotheforefrontofpublicperceptionsabouthowto analyzedata,andmorethanafewpunditshavepredictedthedemiseofclassic statisticalmodeling.

ToparaphraseMarkTwain,webelievethatreportsofthe(impending) deathofstatisticalmodelingingeneral,andregressionmodelinginparticular, areexaggerated.Thegreatadvantagethatstatisticalmodelshaveover“black box”algorithmsisthatinadditiontoeffectiveprediction,theirtransparency alsoprovidesguidanceabouttheactualunderlyingprocess(whichiscrucial fordecisionmaking),andaffordsthepossibilitiesofmakinginferencesand distinguishingrealeffectsfromrandomvariationbasedonthosemodels. Therehavebeenlaudableattemptstoencouragemakingmachinelearning algorithmsinterpretableinthewaysregressionmodelsare(Rudin,2019),but webelievethatmodelsbasedonstatisticalconsiderationsandprincipleswill haveaplaceintheanalyst’stoolkitforalongtimetocome.

Ofcourse,partofthatusefulnesscomesfromtheabilitytogeneralize regressionmodelstomorecomplexsituations,andthatisthethrustofthe changesinthisnewedition.Onethingthathasn’tchangedisthephilosophy behindthebook,andourrecommendationsonhowitcanbebestused,and weencouragethereadertorefertotheprefacetothefirsteditionforguidance onthosepoints.Therehavebeensmallchangestotheoriginalchapters,and broaddescriptionsofthosechapterscanalsobefoundintheprefacetothe firstedition.Thefivenewchapters(Chapters11,13,14,15,and16,with theformerchapter11onnonlinearregressionmovingtoChapter12)expand greatlyonthepowerandapplicabilityofregressionmodelsbeyondwhat wasdiscussedinthefirstedition.Forthisreasonmanymorereferencesare providedinthesechaptersthanintheearlierones,sincesomeofthematerial inthosechaptersislessestablishedandlesswell-known,withmuchofitstill thesubjectofactiveresearch.Inkeepingwiththat,wedonotspendmuch (orany)timeonissuesforwhichtherestillisn’tnecessarilyaconsensusinthe statisticalcommunity,butpointtobooksandmonographsthatcanhelpthe analystgetsomeperspectiveonthatkindofmaterial.

Chapter11discussesthemodelingoftime-to-eventdata,oftenreferred toassurvivaldata.Theresponsevariablemeasuresthelengthoftimeuntilan eventoccurs,andacommoncomplicatoristhatsometimesitisonlyknown

thataresponsevalueisgreaterthansomenumber;thatis,itisright-censored. Thiscannaturallyoccur,forexample,inaclinicaltrialinwhichsubjects enterthestudyatvaryingtimes,andtheeventofinteresthasnotoccurredat theendofthetrial.Analysisfocusesonthesurvivalfunction(theprobability ofsurvivingpastagiventime)andthehazardfunction(theinstantaneous probabilityoftheeventoccurringatagiventimegivensurvivaltothat time).ParametricmodelsbasedonappropriatedistributionsliketheWeibull orlog-logisticcanbefitthattakecensoringintoaccount.Semiparametric modelsliketheCoxproportionalhazardsmodel(themostcommonly-used model)andtheBuckley-Jamesestimatorarealsoavailable,whichweaken distributionalassumptions.Modelingcanbeadaptedtosituationswhere eventtimesaretruncated,andalsowhentherearecovariatesthatchangeover thelifeofthesubject.

Chapter13extendsapplicationstodatawithmultipleobservationsfor eachsubjectconsistentwithsomestructurefromtheunderlyingprocess.Such datacantaketheformofnestedorclustereddata(suchasstudentsallin oneclassroom)orlongitudinaldata(whereavariableismeasuredatmultiple timesforeachsubject).Inthissituationignoringthatstructureresultsinan inducedcorrelationthatreflectsunmodeleddifferencesbetweenclassrooms andsubjects,respectively.Mixedeffectsmodelsgeneralizeanalysisofvariance (ANOVA)modelsandtimeseriesmodelstothismorecomplicatedsituation. ModelswithlineareffectsbasedonGaussiandistributionscanbegeneralized tononlinearmodels,andalsocanbegeneralizedtonon-Gaussiandistributions throughtheuseofgeneralizedlinearmixedeffectsmodels.

Moderndataapplicationscaninvolveverylarge(evenmassive)numbersof predictors,whichcancausemajorproblemsforstandardregressionmethods. Bestsubsetsregression(discussedinChapter2)doesnotscalewelltovery largenumbersofpredictors,andChapter14discussesapproachesthatcan accomplishthat.Forwardstepwiseregression,inwhichpotentialpredictors aresteppedinoneatatime,isanalternativetobestsubsetsthatscales tomassivedatasets.Asystematicapproachtoreducingthedimensionality ofachosenregressionmodelisthroughtheuseofregularization,inwhich theusualestimationcriterionisaugmentedwithapenaltythatencourages sparsity;themostcommonly-usedversionofthisisthelassoestimator,andit anditsgeneralizationsarediscussedfurther.

Chapters15and16discussmethodsthatmoveawayfromspecified relationshipsbetweentheresponseandthepredictortononparametricand semiparametricmethods,inwhichthedataareusedtochoosetheformof theunderlyingrelationship.InChapter15linearor(specificallyspecified) nonlinearrelationshipsarereplacedwiththenotionofrelationshipstakingthe formofsmoothcurvesandsurfaces.Estimationataparticularlocationisbased onlocalinformation;thatis,thevaluesoftheresponseinalocalneighborhood ofthatlocation.Thiscanbedonethroughlocalversionsofweightedleast squares(localpolynomialestimation)orlocalregularization(smoothing splines).Suchmethodscanalsobeusedtohelpidentifyinteractionsbetween numericalpredictorsinlinearregressionmodeling.Singlepredictorsmoothing

estimatorscanbegeneralizedtomultiplepredictorsthroughtheuseofadditive functionsofsmoothcurves.Chapter16focusesonanextremelyflexibleclassof nonparametricregressionestimators,tree-basedmethods.Treesarebasedon thenotionofbinaryrecursivepartitioning.Ateachstepasetofobservations(a node)iseithersplitintotwoparts(childrennodes)onthebasisofthevaluesof achosenvariable,orisnotsplitatall,basedonencouraginghomogeneityinthe childrennodes.Thisapproachprovidesnonparametricalternativestolinear regression(regressiontrees),logisticandmultinomialregression(classification trees),acceleratedfailuretimeandproportionalhazardsregression(survival trees)andmixedeffectsregression(longitudinaltrees).

Afinalsmallchangefromthefirsteditiontothesecondeditionisinthe title,asitnowincludesthephrase WithApplicationsinR.Thisisnotreally achange,ofcourse,asalloftheanalysesinthefirsteditionwereperformed usingthestatisticspackageR.Codefortheoutputandfiguresinthebook can(still)befoundatitsassociatedwebsiteat http://people.stern .nyu.edu/jsimonof/RegressionHandbook/.Aswasthecaseinthe firstedition,eventhoughanalysesareperformedinR,westillrefertogeneral issuesrelevanttoadataanalystintheuseofstatisticalsoftwareevenifthose issuesdon’tspecificallyapplytoR.

Wewouldliketoonceagainthankourstudentsandcolleaguesfortheir encouragementandsupport,andinparticularstudentsforthetoughquestions thathavedefinitelyaffectedourviewsonstatisticalmodelingandbyextension thisbook.WewouldliketothankJonGurstelle,andlaterKathleenSantoloci andMindyOkura-Marszycki,forapproachinguswithencouragementto undertakeasecondedition.WewouldliketothankSarahKeeganforher patientsupportinbringingthebooktofruitioninherroleasProjectEditor. WewouldliketothankRoniChambersforcomputingassistance,andGlenn HellerandMarcScottforlookingatearlierdraftsofchapters.Finally,we wouldliketothankourfamiliesfortheircontinuingloveandsupport.

JEFFREY S.SIMONOFF

October,2019

SAMPRIT CHATTERJEE Brooksville,Maine

Prefacetothe FirstEdition

HowtoUseThisBook

Thisbookisdesignedtobeapracticalguidetoregressionmodeling.Thereis littletheoryhere,andmethodologyappearsintheserviceoftheultimategoal ofanalyzingrealdatausingappropriateregressiontools.Assuch,thetarget audienceofthebookincludesanyonewhoisfacedwithregressiondata[that is,datawherethereisaresponsevariablethatisbeingmodeledasafunction ofothervariable(s)],andwhosegoalistolearnasmuchaspossiblefrom thatdata.

Thebookcanbeusedasatextforanappliedregressioncourse(indeed, muchofitisbasedonhandoutsthathavebeengiventostudentsinsucha course),butthatisnotitsprimarypurpose;rather,itisaimedmuchmore broadlyasasourceofpracticaladviceonhowtoaddresstheproblemsthat comeupwhendealingwithregressiondata.Whileatextisusuallyorganized inawaythatmakesthechaptersinterdependent,successivelybuildingon eachother,thatisnotthecasehere.Indeed,weencouragereaderstodipinto differentchaptersforpracticaladviceonspecifictopicsasneeded.Thepace ofthebookisfasterthanmighttypicallybethecaseforatext.Thecoverage, whileatanappliedlevel,doesnotshyawayfromsophisticatedconcepts.Itis distinctfrom,forexample,ChatterjeeandHadi(2012),whilealsohavingless theoreticalfocusthantextssuchasGreene(2011),Montgomeryetal.(2012), orSenandSrivastava(1990).

This,however,isnotacookbookthatpresentsamechanicalapproachto doingregressionanalysis.Dataanalysisisperhapsanart,andcertainlyacraft; webelievethatthegoalofanydataanalysisbookshouldbetohelpanalysts developtheskillsandexperiencenecessarytoadjusttotheinevitabletwists andturnsthatcomeupwhenanalyzingrealdata.

Weassumethatthereaderpossessesanoddingacquaintancewithregressionanalysis.Thereadershouldbefamiliarwiththebasicterminologyand shouldhavebeenexposedtobasicregressiontechniquesandconcepts,atleast atthelevelofsimple(one-predictor)linearregression.Wealsoassumethat theuserhasaccesstoacomputerwithanadequateregressionpackage.The materialpresentedhereisnottiedtoanyparticularsoftware.Almostallofthe analysesdescribedherecanbeperformedbymoststandardpackages,although theeaseofdoingthiscouldvary.Alloftheanalysespresentedherewere doneusingthefreepackageR(RDevelopmentCoreTeam,2017),whichis availableformanydifferentoperatingsystemplatforms(see http://www .R-project.org/ formoreinformation).Codefortheoutputandfigures

inthebookcanbefoundatitsassociatedwebsiteat http://people .stern.nyu.edu/jsimonof/RegressionHandbook/. Eachchapterofthebookislaidoutinasimilarway,withmosthavingat leastfoursectionsofspecifictypes.Firstisanintroduction,wherethegeneral issuesthatwillbediscussedinthatchapterarepresented.Asectiononconcepts andbackgroundmaterialfollows,whereadiscussionoftherelationshipof thechapter’smaterialtothebroaderstudyofregressiondataisthefocus. Thissectionalsoprovidesanytheoreticalbackgroundforthematerialthatis necessary.Sectionsonmethodologyfollow,wherethespecifictoolsusedin thechapterarediscussed.Thisiswhererelevantalgorithmicdetailsarelikely toappear.Finally,eachchapterincludesatleastoneanalysisofrealdatausing themethodsdiscussedinthechapter(aswellasappropriatematerialfrom earlierchapters),includingbothmethodologicalandgraphicalanalyses.

Thebookbeginswithdiscussionofthemultipleregressionmodel.Many regressiontextbooksstartwithdiscussionofsimpleregressionbeforemoving ontomultipleregression.Thisisquitereasonablefromapedagogicalpoint ofview,sincesimpleregressionhasthegreatadvantageofbeingeasyto understandgraphically,butfromapracticalpointofviewsimpleregression israrelytheprimarytoolinanalysisofrealdata.Forthatreason,westart withmultipleregression,andnotethesimplificationsthatcomefromthe specialcaseofasinglepredictor.Chapter1describesthebasicsofthemultiple regressionmodel,includingtheassumptionsbeingmade,andbothestimation andinferencetools,whilealsogivinganintroductiontotheuseofresidual plotstocheckassumptions.

Sinceitisunlikelythatthefirstmodelexaminedwillultimatelybethe finalpreferredmodel,Chapter2focusesontheveryimportantareasofmodel buildingandmodelselection.Thisincludesaddressingtheissueofcollinearity, aswellastheuseofbothhypothesistestsandinformationmeasurestohelp chooseamongcandidatemodels.

Chapters3through5studycommonviolationsofregressionassumptions, andmethodsavailabletoaddressthosemodelviolations.Chapter3focuseson unusualobservations(outliersandleveragepoints),whileChapter4describes howtransformations(especiallythelogtransformation)canoftenaddressboth nonlinearityandnonconstantvarianceviolations.Chapter5isanintroduction totimeseriesregression,andtheproblemscausedbyautocorrelation.Time seriesanalysisisavastareaofstatisticalmethodology,soourgoalinthis chapterisonlytoprovideagoodpracticalintroductiontothatareainthe contextofregressionanalysis.

Chapters6and7focusonthesituationwheretherearecategoricalvariables amongthepredictors.Chapter6treatsanalysisofvariance(ANOVA)models, whichincludeonlycategoricalpredictors,whileChapter7looksatanalysisof covariance(ANCOVA)models,whichincludebothnumericalandcategorical predictors.Theexaminationofinteractioneffectsisafundamentalaspectof thesemodels,asarequestionsrelatedtosimultaneouscomparisonofmany

groupstoeachother.Dataofthistypeoftenexhibitnonconstantvariance relatedtothedifferentsubgroupsinthepopulation,andtheappropriatetool toaddressthisissue,weightedleastsquares,isalsoafocushere.

Chapters8though10examinethesituationwherethenatureofthe responsevariableissuchthatGaussian-basedleastsquaresregressionisno longerappropriate.Chapter8focusesonlogisticregression,designedfor binaryresponsedataandbasedonthebinomialrandomvariable.While therearemanyparallelsbetweenlogisticregressionanalysisandleastsquares regressionanalysis,therearealsoissuesthatcomeupinlogisticregression thatrequirespecialcare.Chapter9usesthemultinomialrandomvariableto generalizethemodelsofChapter8toallowformultiplecategoriesinthe responsevariable,outliningmodelsdesignedforresponsevariablesthateither doordonothaveorderedcategories.Chapter10focusesonresponsedatain theformofcounts,wheredistributionslikethePoissonandnegativebinomial playacentralrole.Theconnectionbetweenallthesemodelsthroughthe generalizedlinearmodelframeworkisalsoexploitedinthischapter.

Thefinalchapterfocusesonsituationswherelinearitydoesnothold, andanonlinearrelationshipisnecessary.Althoughthesemodelsarebasedon leastsquares,frombothanalgorithmicandinferentialpointofviewthere arestrongconnectionswiththemodelsofChapters8through10,whichwe highlight.

ThisHandbookcanbeusedinseveraldifferentways.First,areadermay usethebooktofindinformationonaspecifictopic.Ananalystmightwant additionalinformationon,forexample,logisticregressionorautocorrelation. Thechaptersonthese(andother)topicsprovidethereaderwiththissubject matterinformation.Asnotedabove,thechaptersalsoincludeatleastone analysisofadataset,aclarificationofcomputeroutput,andreferenceto sourceswhereadditionalmaterialcanbefound.Thechaptersinthebookare toalargeextentself-containedandcanbeconsultedindependentlyofother chapters.

Thebookcanalsobeusedasatemplateforwhatweviewasareasonable approachtodataanalysisingeneral.Thisisbasedonthecyclicalparadigm ofmodelformulation,modelfitting,modelevaluation,andmodelupdating leadingbacktomodel(re)formulation.Statisticalsignificanceofteststatistics doesnotnecessarilymeanthatanadequatemodelhasbeenobtained.Further analysisneedstobeperformedbeforethefittedmodelcanberegardedas anacceptabledescriptionofthedata,andthisbookconcentratesonthis importantaspectofregressionmethodology.Detectionofdeficienciesoffit isbasedonbothtestingandgraphicalmethods,andbothapproachesare highlightedhere.

ThisprefaceisintendedtoindicatewaysinwhichtheHandbookcan beused.Ourhopeisthatitwillbeausefulguidefordataanalysts,andwill helpcontributetoeffectiveanalyses.Wewouldliketothankourstudentsand colleaguesfortheirencouragementandsupport.Wehopewehaveprovided

themwithabookofwhichtheywouldapprove.WewouldliketothankSteve Quigley,JackiePalmieri,andAmyHendricksonfortheirhelpinbringingthis manuscripttoprint.Wewouldalsoliketothankourfamiliesfortheirlove andsupport.

SAMPRIT CHATTERJEE

Brooksville,Maine

JEFFREY S.SIMONOFF NewYork,NewYork

August,2012

Part One

TheMultipleLinear RegressionModel

One Chapter

MultipleLinearRegression

1.1 Introduction 3

1.2 ConceptsandBackgroundMaterial 4

1.2.1 TheLinearRegressionModel 4 1.2.2 EstimationUsingLeastSquares 5

1.2.3 Assumptions 8

1.3 Methodology 9

1.3.1 InterpretingRegressionCoefficients 9

1.3.2 MeasuringtheStrengthoftheRegression Relationship 10

1.3.3 HypothesisTestsandConfidenceIntervalsfor β 12

1.3.4 FittedValuesandPredictions 13

1.3.5 CheckingAssumptionsUsingResidualPlots 14

1.4 Example—EstimatingHomePrices 15

1.5 Summary 19

1.1 Introduction

Thisisabookaboutregressionmodeling,butwhenwerefertoregression models,whatdowemean?Theregressionframeworkcanbecharacterizedin thefollowingway:

1. Wehaveoneparticularvariablethatweareinterestedinunderstanding ormodeling,suchassalesofaparticularproduct,salepriceofahome,or

votingpreferenceofaparticularvoter.Thisvariableiscalledthe target, response,or dependent variable,andisusuallyrepresentedby y .

2. Wehaveasetof p othervariablesthatwethinkmightbeusefulin predictingormodelingthetargetvariable(thepriceoftheproduct,the competitor’sprice,andsoon;orthelotsize,numberofbedrooms,number ofbathroomsofthehome,andsoon;orthegender,age,income,party membershipofthevoter,andsoon).Thesearecalledthe predicting,or independent variables,andareusuallyrepresentedby x1 , x2 ,etc.

Typically,aregressionanalysisisusedforone(ormore)ofthreepurposes:

1. modelingtherelationshipbetween x and y ;

2. predictionofthetargetvariable(forecasting);

3. andtestingofhypotheses.

Inthischapter,weintroducethebasicmultiplelinearregressionmodel, anddiscusshowthismodelcanbeusedforthesethreepurposes.Specifically,we discusstheinterpretationsoftheestimatesofdifferentregressionparameters, theassumptionsunderlyingthemodel,measuresofthestrengthofthe relationshipbetweenthetargetandpredictorvariables,theconstructionof testsofhypothesesandintervalsrelatedtoregressionparameters,andthe checkingofassumptionsusingdiagnosticplots.

1.2

1.2.1

ConceptsandBackgroundMaterial

THELINEARREGRESSIONMODEL

Thedataconsistof n observations,whicharesetsofobservedvalues {x1i ,x2i , ...,xpi ,yi } thatrepresentarandomsamplefromalargerpopulation.Itis assumedthattheseobservationssatisfyalinearrelationship,

(1.1) wherethe β coefficientsareunknownparameters,andthe εi arerandomerror terms.Bya linear model,itismeantthatthemodelislinearinthe parameters; aquadraticmodel,

paradoxicallyenough,isalinearmodel,since x and x2 arejustversionsof x1 and x2 .

Itisimportanttorecognizethatthis,oranystatisticalmodel,isnot viewedasa true representationofreality;rather,thegoalisthatthemodel bea useful representationofreality.Amodelcanbeusedtoexplorethe relationshipsbetweenvariablesandmakeaccurateforecastsbasedonthose relationshipsevenifitisnotthe“truth.”Further,anystatisticalmodelis onlytemporary,representingaprovisionalversionofviewsabouttherandom processbeingstudied.Modelscan,andshould,change,basedonanalysisusing

FIGURE1.1:Thesimplelinearregressionmodel.Thesolidlinecorresponds tothetrueregressionline,andthedottedlinescorrespondtotherandom errors εi . thecurrentmodel,selectionamongseveralcandidatemodels,theacquisition ofnewdata,newunderstandingoftheunderlyingrandomprocess,andso on.Further,itisoftenthecasethatthereareseveraldifferentmodelsthat arereasonablerepresentationsofreality.Havingsaidthis,wewillsometimes refertothe“true”model,butthisshouldbeunderstoodasreferringtothe underlyingformofthecurrentlyhypothesizedrepresentationoftheregression relationship.

Thespecialcaseof(1.1)with p =1 correspondstothe simpleregression model,andisconsistentwiththerepresentationinFigure1.1.Thesolidline isthetrueregressionline,theexpectedvalueof y giventhevalueof x.The dottedlinesaretherandomerrors εi thataccountforthelackofaperfect associationbetweenthepredictorandthetargetvariables.

1.2.2 ESTIMATIONUSINGLEASTSQUARES

Thetrueregressionfunctionrepresentstheexpectedrelationshipbetweenthe targetandthepredictorvariables,whichisunknown.Aprimarygoalofa regressionanalysisistoestimatethisrelationship,orequivalently,toestimate theunknownparameters β.Thisrequiresadata-basedrule,orcriterion, thatwillgiveareasonableestimate.Thestandardapproachis leastsquares regression,wheretheestimatesarechosentominimize

Figure1.2givesagraphicalrepresentationofleastsquaresthatisbased onFigure1.1.Nowthetrueregressionlineisrepresentedbythegrayline,

FIGURE1.2:Leastsquaresestimationforthesimplelinearregressionmodel, usingthesamedataasinFigure1.1.Thegraylinecorrespondstothetrue regressionline,thesolidblacklinecorrespondstothefittedleastsquares line(designedtoestimatethegrayline),andthelengthsofthedottedlines correspondtotheresiduals.Thesumofsquaredvaluesofthelengthsofthe dottedlinesisminimizedbythesolidblackline.

andthesolidblacklineistheestimatedregressionline,designedtoestimate the(unknown)graylineascloselyaspossible.Foranychoiceofestimated parameters ˆ β,theestimatedexpectedresponsevaluegiventheobserved predictorvaluesequals

andiscalledthe fittedvalue.Thedifferencebetweentheobservedvalue yi andthefittedvalue ˆ yi iscalledthe residual,thesetofwhichisrepresentedby thesignedlengthsofthedottedlinesinFigure1.2.Theleastsquaresregression lineminimizesthesumofsquaresofthelengthsofthedottedlines;thatis, theordinaryleastsquares(OLS)estimatesminimizethesumofsquaresofthe residuals.

Inhigherdimensions(p> 1),thetrueandestimatedregressionrelationshipscorrespondtoplanes(p =2)orhyperplanes(p ≥ 3),butotherwisethe principlesarethesame.Figure1.3illustratesthecasewithtwopredictors. Thelengthofeachverticallinecorrespondstoaresidual(solidlinesreferto positiveresiduals,whiledashedlinesrefertonegativeresiduals),andthe(least squares)planethatgoesthroughtheobservationsischosentominimizethe sumofsquaresoftheresiduals.

FIGURE1.3:Leastsquaresestimationforthemultiplelinearregression modelwithtwopredictors.Theplanecorrespondstothefittedleastsquares relationship,andthelengthsoftheverticallinescorrespondtotheresiduals. Thesumofsquaredvaluesofthelengthsoftheverticallinesisminimizedby theplane.

Thelinearregressionmodelcanbewrittencompactlyusingmatrix notation.Definethefollowingmatrixandvectorsasfollows:

Theregressionmodel(1.1)isthen

Thenormalequations[whichdeterminetheminimizerof(1.2)]canbe shown(usingmultivariatecalculus)tobe

whichimpliesthattheleastsquaresestimatessatisfy

Thefittedvaluesarethen

where H = X (X X ) 1 X istheso-called“hat”matrix(sinceittakes y to ˆ y).

Theresiduals e = y ˆ y thussatisfy

1.2.3 ASSUMPTIONS

Theleastsquarescriterionwillnotnecessarilyyieldsensibleresultsunless certainassumptionshold.Oneisgivenin(1.1)—thelinearmodelshould beappropriate.Inaddition,thefollowingassumptionsareneededtojustify usingleastsquaresregression.

1. Theexpectedvalueoftheerrorsiszero(E (εi )=0 forall i).Thatis,it cannotbetruethatforcertainobservationsthemodelissystematically toolow,whileforothersitissystematicallytoohigh.Aviolationofthis assumptionwillleadtodifficultiesinestimating β0 .Moreimportantly, thisreflectsthatthemodeldoesnotincludeanecessarysystematic component,whichhasinsteadbeenabsorbedintotheerrorterms.

2. Thevarianceoftheerrorsisconstant(V (εi )= σ 2 forall i).Thatis, itcannotbetruethatthestrengthofthemodelisgreaterforsome partsofthepopulation(smaller σ )andlessforotherparts(larger σ ). Thisassumptionofconstantvarianceiscalled homoscedasticity,andits violation(nonconstantvariance)iscalled heteroscedasticity.Aviolation ofthisassumptionmeansthattheleastsquaresestimatesarenotasefficient astheycouldbeinestimatingthetrueparameters,andbetterestimatesare available.Moreimportantly,italsoresultsinpoorlycalibratedconfidence and(especially)predictionintervals.

3. Theerrorsareuncorrelatedwitheachother.Thatis,itcannotbetrue thatknowingthatthemodelunderpredicts y (forexample)forone particularobservationsaysanythingatallaboutwhatitdoesforany otherobservation.Thisviolationmostoftenoccursindatathatare orderedintime(timeseriesdata),whereerrorsthatareneareachother intimeareoftensimilartoeachother(suchtime-relatedcorrelation iscalled autocorrelation).Violationofthisassumptionmeansthatthe leastsquaresestimatesarenotasefficientastheycouldbeinestimating thetrueparameters,andmoreimportantly,itspresencecanleadtovery misleadingassessmentsofthestrengthoftheregression.

4. Theerrorsarenormallydistributed.Thisisneededifwewanttoconstruct anyconfidenceorpredictionintervals,orhypothesistests,whichwe usuallydo.Ifthisassumptionisviolated,hypothesistestsandconfidence andpredictionintervalscanbeverymisleading.

Sinceviolationoftheseassumptionscanpotentiallyleadtocompletely misleadingresults,afundamentalpartofanyregressionanalysisistocheck themusingvariousplots,tests,anddiagnostics.

1.3 Methodology

1.3.1

INTERPRETINGREGRESSIONCOEFFICIENTS

Theleastsquaresregressioncoefficientshaveveryspecificmeanings.Theyare oftenmisinterpreted,soitisimportanttobeclearonwhattheymean(anddo notmean).Considerfirsttheintercept, ˆ β0

β0 :Theestimatedexpectedvalueofthetargetvariablewhenthepredictors areallequaltozero.

Notethatthismightnothaveanyphysicalinterpretation,sinceazerovaluefor thepredictor(s)mightbeimpossible,ormightnevercomeclosetooccurring intheobserveddata.Inthatsituation,itispointlesstotrytointerpret thisvalue.Ifallofthepredictorsarecenteredtohavezeromean,then ˆ β0 necessarilyequals Y ,thesamplemeanofthetargetvalues.Notethatifthere isanyparticularvalueforeachpredictorthatismeaningfulinsomesense,if eachvariableiscenteredarounditsparticularvalue,thentheinterceptisan estimateof E (y ) whenthepredictorsallhavethosemeaningfulvalues.

Theestimatedcoefficientforthe j thpredictor(j =1,...,p)isinterpreted inthefollowingway:

ˆ βj :Theestimatedexpectedchangeinthetargetvariableassociatedwithaone unitchangeinthe j thpredictingvariable,holdingallelseinthemodel fixed.

Thereareseveralnoteworthyaspectstothisinterpretation.First,notethe word associated —wecannotsaythatachangeinthetargetvariableis caused byachangeinthepredictor,onlythattheyareassociatedwitheachother. Thatis,correlationdoesnotimplycausation.

Anotherkeypointisthephrase“holdingallelseinthemodelfixed,”the implicationsofwhichareoftenignored.Considerthefollowinghypothetical example.Arandomsampleofcollegestudentsataparticularuniversityis takeninordertounderstandtherelationshipbetweencollegegradepoint average(GPA)andothervariables.AmodelisbuiltwithcollegeGPAasa functionofhighschoolGPAandthestandardizedScholasticAptitudeTest (SAT),withresultantleastsquaresfit

CollegeGPA =1 3+ 7 × HighSchoolGPA 0001 × SAT

Itistemptingtosay(andmanypeoplewouldsay)thatthecoefficientfor SATscorehasthe“wrongsign,”becauseitsaysthathighervaluesofSAT

areassociatedwithlowervaluesofcollegeGPA.Thisis not correct.The problemisthatitislikelyinthiscontextthatwhatananalystwouldfind intuitiveisthe marginal relationshipbetweencollegeGPAandSATscorealone (ignoringallelse),onethatwewouldindeedexpecttobeadirect(positive) one.Theregressioncoefficientdoesnotsayanythingaboutthatmarginal relationship.Rather,itreferstotheconditional(sometimescalledpartial) relationshipthattakesthehighschoolGPAasfixed,whichisapparently thathighervaluesofSATareassociatedwithlowervaluesofcollegeGPA, holdinghighschoolGPAfixed.HighschoolGPAandSATarenodoubt relatedtoeachother,anditisquitelikelythatthisrelationshipbetween thepredictorswouldcomplicateanyunderstandingof,orintuitionabout, theconditionalrelationshipbetweencollegeGPAandSATscore.Multiple regressioncoefficientsshouldnotbeinterpretedmarginally;ifyoureallyare interestedintherelationshipbetweenthetargetandasinglepredictoralone, youshouldsimplydoaregressionofthetargetononlythatvariable.This doesnotmeanthatmultipleregressioncoefficientsareuninterpretable,only thatcareisnecessarywheninterpretingthem.

Anothercommonuseofmultipleregressionthatdependsonthisconditionalinterpretationofthecoefficientsistoexplicitlyinclude“control” variablesinamodelinordertotrytoaccountfortheireffectstatistically.This isparticularlyimportantinobservationaldata(datathatarenottheresultofa designedexperiment),sinceinthatcase,theeffectsofothervariablescannotbe ignoredasaresultofrandomassignmentintheexperiment.Forobservational dataitisnotpossibletophysicallyinterveneintheexperimentto“holdother variablesfixed,”butthemultipleregressionframeworkeffectivelyallowsthis tobedonestatistically.

Havingsaidthis,wemustrecognizethatinmanysituations,itisimpossible fromapracticalpointofviewtochangeonepredictorwhileholdingallelse fixed.Thus,whilewewouldliketointerpretacoefficientasaccountingforthe presenceofotherpredictorsinaphysicalsense,itisimportant(whendealing withobservationaldatainparticular)torememberthatlinearregressionisat bestonlyanapproximationtotheactualunderlyingrandomprocess.

1.3.2

MEASURINGTHESTRENGTHOFTHEREGRESSION RELATIONSHIP

Theleastsquaresestimatespossessanimportantproperty:

Thisformulasaysthatthevariabilityinthetargetvariable(theleftsideof theequation,termedthecorrectedtotalsumofsquares)canbesplitintotwo mutuallyexclusiveparts—thevariabilityleftoverafterdoingtheregression (thefirsttermontherightside,theresidualsumofsquares),andthevariability accountedforbydoingtheregression(thesecondterm,theregressionsumof

squares).Thisimmediatelysuggeststheusefulnessof R2 asameasureofthe strengthoftheregressionrelationship,where

R2 = i (ˆ yi Y )2 i (yi Y )2 ≡

RegressionSS CorrectedtotalSS =1

ResidualSS CorrectedtotalSS .

The R2 value(alsocalledthe coefficientofdetermination)estimatesthe populationproportionofvariabilityin y accountedforbythebestlinear combinationofthepredictors.Valuescloserto 1 indicateagooddealof predictivepowerofthepredictorsforthetargetvariable,whilevaluescloser to 0 indicatelittlepredictivepower.Anequivalentrepresentationof R2 is R2 =corr(yi , ˆ yi )2 , where corr(yi , ˆ yi )=

(yi Y )(ˆ yi ˆ Y )

isthesamplecorrelationcoefficientbetween y and ˆ y (thiscorrelationiscalled themultiplecorrelationcoefficient).Thatis, R2 isadirectmeasureofhow similartheobservedandfittedtargetvaluesare.

Itcanbeshownthat R2 isbiasedupwardsasanestimateofthepopulation proportionofvariabilityaccountedforbytheregression.The adjusted R2 correctsthisbias,andequals

Itisapparentfrom(1.7)thatunless p islargerelativeto n p 1 (thatis, unlessthenumberofpredictorsislargerelativetothesamplesize), R2 and R2 a willbeclosetoeachother,andthechoiceofwhichtouseisaminor concern.Whatisperhapsmoreinterestingisthenatureof R2 a asprovidingan explicittradeoffbetweenthestrengthofthefit(thefirstterm,withlarger R2 correspondingtostrongerfitandlarger R2 a )andthecomplexityofthemodel (thesecondterm,withlarger p correspondingtomorecomplexityandsmaller R2 a ).Thistradeoffoffidelitytothedataversussimplicitywillbeimportantin thediscussionofmodelselectioninSection2.3.1.

Theonlyparameterleftunaccountedforintheestimationschemeisthe varianceoftheerrors σ 2 .Anunbiasedestimateisprovidedbythe residual meansquare

Thisestimatehasadirect,butoftenunderappreciated,useinassessing thepracticalimportanceofthemodel.Doesknowing x1 ,...,xp really sayanythingofvalueabout y ?Thisisn’taquestionthatcanbeanswered completelystatistically;itrequiresknowledgeandunderstandingofthedata andtheunderlyingrandomprocess(thatis,itrequirescontext).Recallthat themodelassumesthattheerrorsarenormallydistributedwithstandard

deviation σ .Thismeansthat,roughlyspeaking, 95% ofthetimeanobserved y valuefallswithin ±2σ oftheexpectedresponse

(y )= β0 + β1 x1 + ··· + βp xp .

E (y ) canbeestimatedforanygivensetof x valuesusing

, whilethesquarerootoftheresidualmeansquare(1.8),termedthe standard erroroftheestimate,providesanestimateof σ thatcanbeusedinconstructing thisroughpredictioninterval ±2ˆ σ .

1.3.3 HYPOTHESISTESTSANDCONFIDENCEINTERVALS FOR β

Therearetwotypesofhypothesistestsofimmediateinterestrelatedtothe regressioncoefficients.

1. Do any ofthepredictorsprovidepredictivepowerforthetargetvariable?

Thisisatestoftheoverallsignificanceoftheregression,

0 : β1 = = βp =0 versus Ha :atleastone βj =0,j =1,...,p.

Thetestofthesehypothesesisthe F -test, F = RegressionMS ResidualMS ≡ RegressionSS/p ResidualSS/(n p 1)

Thisisreferencedagainstanull F -distributionon (p,n p 1) degrees offreedom.

2. Giventheothervariablesinthemodel,doesaparticularpredictorprovide additionalpredictivepower?Thiscorrespondstoatestofthesignificance ofanindividualcoefficient,

Thisistestedusinga t-test,

whichiscomparedtoa t-distributionon n p 1 degreesoffreedom. Othervaluesof βj canbespecifiedinthenullhypothesis(say βj 0 ),with the t-statisticbecoming

(1.9)