An introduction to statistical learning with applications in r ebook - The ebook is now available, j by Education Libraries

The Trajectory of Global Education Policy: Community-Based Management in El Salvador and the Global Reform Agenda 1st Edition D. Brent Edwards Jr. (Auth.)

https://ebookmass.com/product/the-trajectory-of-global-educationpolicy-community-based-management-in-el-salvador-and-the-globalreform-agenda-1st-edition-d-brent-edwards-jr-auth/

ebookmass.com

Bagels, Schmears, and a Nice Piece of Fish Cathy Barrow

https://ebookmass.com/product/bagels-schmears-and-a-nice-piece-offish-cathy-barrow/

ebookmass.com

Commercial

Aviation Safety, Sixth Edition

https://ebookmass.com/product/commercial-aviation-safety-sixthedition/

ebookmass.com

Retrosynthesis in the Manufacture of Generic Drugs Pedro Paulo Santos

https://ebookmass.com/product/retrosynthesis-in-the-manufacture-ofgeneric-drugs-pedro-paulo-santos/

ebookmass.com

Strategies and Models for Teachers: Teaching Content and Thinking Skills 6th Edition Paul Eggen And Don Kauchak

https://ebookmass.com/product/strategies-and-models-for-teachersteaching-content-and-thinking-skills-6th-edition-paul-eggen-and-donkauchak/

ebookmass.com

Auditing: Principles and Practices 1st Edition Ashish Kumar Sana

https://ebookmass.com/product/auditing-principles-and-practices-1stedition-ashish-kumar-sana/

ebookmass.com

disciplineswhowishtousestatisticallearningtoolstoanalyzetheirdata. Itcanbeusedasatextbookforacoursespanningoneortwosemesters. Wewouldliketothankseveralreadersforvaluablecommentsonpreliminarydraftsofthisbook:PallaviBasu,AlexandraChouldechova,Patrick Danaher,WillFithian,LuellaFu,SamGross,MaxGrazierG’Sell,CourtneyPaulson,XinghaoQiao,ElisaSheng,NoahSimon,KeanMingTan, andXinLuTan. It’stoughtomakepredictions,especiallyaboutthefuture.

-YogiBerra

LosAngeles,USAGarethJames Seattle,USADanielaWitten PaloAlto,USATrevorHastie PaloAlto,USARobertTibshirani

2.1.1WhyEstimate f ?

2.1.2HowDoWeEstimate f ?

2.1.3TheTrade-OﬀBetweenPredictionAccuracy andModelInterpretability

2.1.4SupervisedVersusUnsupervisedLearning

2.1.5RegressionVersusClassiﬁcationProblems

2.2AssessingModelAccuracy

2.2.1MeasuringtheQualityofFit

2.2.2TheBias-VarianceTrade-Oﬀ

2.2.3TheClassiﬁcationSetting

2.3Lab:IntroductiontoR

2.3.1BasicCommands

2.3.2Graphics

2.3.3IndexingData

2.3.4LoadingData

2.3.5AdditionalGraphicalandNumericalSummaries

2.4Exercises

3LinearRegression 59

3.1SimpleLinearRegression ...................61

3.1.1EstimatingtheCoeﬃcients ..............61

3.1.2AssessingtheAccuracyoftheCoeﬃcient Estimates ........................63

3.1.3AssessingtheAccuracyoftheModel .........68

3.2MultipleLinearRegression ..................71

3.2.1EstimatingtheRegressionCoeﬃcients ........72

3.2.2SomeImportantQuestions ..............75

3.3OtherConsiderationsintheRegressionModel ........82

3.3.1QualitativePredictors .................82

3.3.2ExtensionsoftheLinearModel ............86

3.3.3PotentialProblems ...................92

3.4TheMarketingPlan ......................102

3.5ComparisonofLinearRegressionwith K -Nearest Neighbors ............................104

3.6Lab:LinearRegression .....................109

3.6.1Libraries .........................109

3.6.2SimpleLinearRegression

3.6.3MultipleLinearRegression ..............113

3.6.4InteractionTerms ...................115

3.6.5Non-linearTransformationsofthePredictors ....115

3.6.6QualitativePredictors .................117

3.6.7WritingFunctions ...................119

3.7Exercises ............................120

4Classiﬁcation

4.1AnOverviewofClassiﬁcation

4.2WhyNotLinearRegression?

4.3LogisticRegression .......................130

4.3.1TheLogisticModel ...................131

4.3.2EstimatingtheRegressionCoeﬃcients ........133

4.3.3MakingPredictions ...................134

4.3.4MultipleLogisticRegression ..............135

4.3.5LogisticRegressionfor >2ResponseClasses .....137

4.4LinearDiscriminantAnalysis .................138

4.4.1UsingBayes’TheoremforClassiﬁcation .......138

4.4.2LinearDiscriminantAnalysisfor p =1 ........139

4.4.3LinearDiscriminantAnalysisfor p>1 ........142

4.4.4QuadraticDiscriminantAnalysis ...........149

4.5AComparisonofClassiﬁcationMethods ...........151

4.6Lab:LogisticRegression,LDA,QDA,andKNN ......154

4.6.1TheStockMarketData ................154

4.6.2LogisticRegression ...................156

4.6.3LinearDiscriminantAnalysis .............161

4.6.4QuadraticDiscriminantAnalysis ...........163

4.6.5 K -NearestNeighbors ..................163

4.6.6AnApplicationtoCaravanInsuranceData .....165

4.7Exercises ............................168

5ResamplingMethods

5.1Cross-Validation ........................176

5.1.1TheValidationSetApproach

5.1.2Leave-One-OutCross-Validation

5.1.3 k -FoldCross-Validation ................181

5.1.4Bias-VarianceTrade-Oﬀfor k -Fold Cross-Validation ....................183

5.1.5Cross-ValidationonClassiﬁcationProblems .....184

5.2TheBootstrap .........................187

5.3Lab:Cross-ValidationandtheBootstrap ...........190

5.3.1TheValidationSetApproach .............191

5.3.2Leave-One-OutCross-Validation ...........192

5.3.3 k -FoldCross-Validation ................193

5.3.4TheBootstrap .....................194

5.4Exercises ............................197

6LinearModelSelectionandRegularization 203

6.1SubsetSelection ........................205

6.1.1BestSubsetSelection

6.1.2StepwiseSelection ...................207

6.1.3ChoosingtheOptimalModel

6.2ShrinkageMethods .......................214

6.2.1RidgeRegression ....................215

6.2.2TheLasso ........................219

6.2.3SelectingtheTuningParameter ............227

6.3DimensionReductionMethods ................228

6.3.1PrincipalComponentsRegression ...........230

6.3.2PartialLeastSquares .................237

6.4ConsiderationsinHighDimensions ..............238

6.4.1High-DimensionalData ................238

6.4.2WhatGoesWronginHighDimensions? .......239

6.4.3RegressioninHighDimensions ............241

6.4.4InterpretingResultsinHighDimensions .......243

6.5Lab1:SubsetSelectionMethods ...............244

6.5.1BestSubsetSelection .................244

6.5.2ForwardandBackwardStepwiseSelection ......247

6.5.3ChoosingAmongModelsUsingtheValidation SetApproachandCross-Validation ..........248

10.5Lab2:Clustering ........................404

10.5.1 K -MeansClustering ..................404

10.5.2HierarchicalClustering .................406

10.6Lab3:NCI60DataExample .................407

10.6.1PCAontheNCI60Data ...............408

10.6.2ClusteringtheObservationsoftheNCI60Data ...410

10.7Exercises ............................413

1 Introduction

AnOverviewofStatisticalLearning

Statisticallearning referstoavastsetoftoolsfor understandingdata.These toolscanbeclassifiedas supervised or unsupervised.Broadlyspeaking, supervisedstatisticallearninginvolvesbuildingastatisticalmodelforpredicting,orestimating,an output basedononeormore inputs.Problemsof thisnatureoccurinfieldsasdiverseasbusiness,medicine,astrophysics,and publicpolicy.Withunsupervisedstatisticallearning,thereareinputsbut nosupervisingoutput;neverthelesswecanlearnrelationshipsandstructurefromsuchdata.Toprovideanillustrationofsomeapplicationsof statisticallearning,webrieflydiscussthreereal-worlddatasetsthatare consideredinthisbook.

WageData

Inthisapplication(whichwerefertoasthe Wage datasetthroughoutthis book),weexamineanumberoffactorsthatrelatetowagesforagroupof malesfromtheAtlanticregionoftheUnitedStates.Inparticular,wewish tounderstandtheassociationbetweenanemployee’s age and education,as wellasthecalendar year,onhis wage.Consider,forexample,theleft-hand panelofFigure 1.1,whichdisplays wage versus age foreachoftheindividualsinthedataset.Thereisevidencethat wage increaseswith age butthen decreasesagainafterapproximatelyage60.Theblueline,whichprovides anestimateoftheaverage wage foragiven age,makesthistrendclearer.

G.Jamesetal., AnIntroductiontoStatisticalLearning:withApplicationsinR, SpringerTextsinStatistics,DOI10.1007/978-1-4614-7138-7 1,

FIGURE1.3. Weﬁtaquadraticdiscriminantanalysismodeltothesubset ofthe Smarket datacorrespondingtothe2001–2004timeperiod,andpredicted theprobabilityofastockmarketdecreaseusingthe2005data.Onaverage,the predictedprobabilityofdecreaseishigherforthedaysinwhichthemarketdoes decrease.Basedontheseresults,weareabletocorrectlypredictthedirectionof movementinthemarket60%ofthetime.

GeneExpressionData

Theprevioustwoapplicationsillustratedatasetswithbothinputand outputvariables.However,anotherimportantclassofproblemsinvolves situationsinwhichweonlyobserveinputvariables,withnocorresponding output.Forexample,inamarketingsetting,wemighthavedemographic informationforanumberofcurrentorpotentialcustomers.Wemaywishto understandwhichtypesofcustomersaresimilartoeachotherbygrouping individualsaccordingtotheirobservedcharacteristics.Thisisknownasa clustering problem.Unlikeinthepreviousexamples,herewearenottrying topredictanoutputvariable.

WedevoteChapter 10 toadiscussionofstatisticallearningmethods forproblemsinwhichnonaturaloutputvariableisavailable.Weconsider the NCI60 dataset,whichconsistsof6,830geneexpressionmeasurements foreachof64cancercelllines.Insteadofpredictingaparticularoutput variable,weareinterestedindeterminingwhethertherearegroups,or clusters,amongthecelllinesbasedontheirgeneexpressionmeasurements. Thisisadiﬃcultquestiontoaddress,inpartbecausetherearethousands ofgeneexpressionmeasurementsper cellline,makingithardtovisualize thedata.

Theleft-handpanelofFigure 1.4 addressesthisproblembyrepresentingeachofthe64celllinesusingjusttwonumbers, Z1 and Z2 .These aretheﬁrsttwo principalcomponents ofthedata,whichsummarizethe 6, 830expressionmeasurementsforeachcelllinedowntotwonumbersor dimensions.Whileitislikelythatthisdimensionreductionhasresultedin

learningwasstartingtoexplode.ESLprovidedoneoftheﬁrstaccessible andcomprehensiveintroductionstothetopic.

SinceESLwasfirstpublished,thefieldofstatisticallearninghascontinuedtoflourish.Thefield’sexpansionhastakentwoforms.Themost obviousgrowthhasinvolvedthedevelopmentofnewandimprovedstatisticallearningapproachesaimedatansweringarangeofscientificquestions acrossanumberoffields.However,thefieldofstatisticallearninghas alsoexpandeditsaudience.Inthe1990s,increasesincomputationalpower generatedasurgeofinterestinthefieldfromnon-statisticianswhowere eagertousecutting-edgestatisticaltoolstoanalyzetheirdata.Unfortunately,thehighlytechnicalnatureoftheseapproachesmeantthattheuser communityremainedprimarilyrestrictedtoexpertsinstatistics,computer science,andrelatedfieldswiththetraining(andtime)tounderstandand implementthem.

Inrecentyears,newandimprovedsoftwarepackageshavesignificantly easedtheimplementationburdenformanystatisticallearningmethods. Atthesametime,therehasbeengrowingrecognitionacrossanumberof fields,frombusinesstohealthcaretogeneticstothesocialsciencesand beyond,thatstatisticallearningisapowerfultoolwithimportantpractical applications.Asaresult,thefieldhasmovedfromoneofprimarilyacademic interesttoamainstreamdiscipline, withanenormouspotentialaudience. Thistrendwillsurelycontinuewiththeincreasingavailabilityofenormous quantitiesofdataandthesoftwaretoanalyzeit.

Thepurposeof AnIntroductiontoStatisticalLearning (ISL)istofacilitatethetransitionofstatisticallearningfromanacademictoamainstream ﬁeld.ISLisnotintendedtoreplaceESL,whichisafarmorecomprehensivetextbothintermsofthenumberofapproachesconsideredandthe depthtowhichtheyareexplored.WeconsiderESLtobeanimportant companionforprofessionals(withgraduatedegreesinstatistics,machine learning,orrelatedﬁelds)whoneedtounderstandthetechnicaldetails behindstatisticallearningapproaches.However,thecommunityofusersof statisticallearningtechniqueshasexpandedtoincludeindividualswitha widerrangeofinterestsandbackgrounds.Therefore,webelievethatthere isnowaplaceforalesstechnicalandmoreaccessibleversionofESL.

Inteachingthesetopicsovertheyears,wehavediscoveredthattheyare ofinteresttomaster’sandPhDstudentsinﬁeldsasdisparateasbusiness administration,biology,andcomputerscience,aswellastoquantitativelyorientedupper-divisionundergraduates.Itisimportantforthisdiverse grouptobeabletounderstandthemodels,intuitions,andstrengthsand weaknessesofthevariousapproaches.Butforthisaudience,manyofthe technicaldetailsbehindstatisticallearningmethods,suchasoptimizationalgorithmsandtheoreticalproperties,arenotofprimaryinterest. Webelievethatthesestudentsdonotneedadeepunderstandingofthese aspectsinordertobecomeinformedus ersofthevariousmethodologies,and

inordertocontributetotheirchosenﬁeldsthroughtheuseofstatistical learningtools.

ISLRisbasedonthefollowingfourpremises.

1. Manystatisticallearningmethodsarerelevantandusefulinawide rangeofacademicandnon-academicdisciplines,beyondjustthestatisticalsciences. Webelievethatmanycontemporarystatisticallearningproceduresshould,andwill,becomeaswidelyavailableandused asiscurrentlythecaseforclassicalmethodssuchaslinearregression.Asaresult,ratherthanattemptingtoconsidereverypossible approach(animpossibletask),wehaveconcentratedonpresenting themethodsthatwebelievearemostwidelyapplicable.

2. Statisticallearningshouldnotbeviewedasaseriesofblackboxes. No singleapproachwillperformwellinallpossibleapplications.Withoutunderstandingallofthecogsinsidethebox,ortheinteraction betweenthosecogs,itisimpossibletoselectthebestbox.Hence,we haveattemptedtocarefullydescribethemodel,intuition,assumptions,andtrade-oﬀsbehindeachofthemethodsthatweconsider.

3. Whileitisimportanttoknowwhatjobisperformedbyeachcog,it isnotnecessarytohavetheskillstoconstructthemachineinsidethe box! Thus,wehaveminimizeddiscussionoftechnicaldetailsrelated toﬁttingproceduresandtheoreticalproperties.Weassumethatthe readeriscomfortablewithbasic mathematicalconcepts,butwedo notassumeagraduatedegreeinthemathematicalsciences.Forinstance,wehavealmostcompletelyavoidedtheuseofmatrixalgebra, anditispossibletounderstandtheentirebookwithoutadetailed knowledgeofmatricesandvectors.

4. Wepresumethatthereaderisinterestedinapplyingstatisticallearningmethodstoreal-worldproblems. Inordertofacilitatethis,aswell astomotivatethetechniquesdiscussed,wehavedevotedasection withineachchapterto R computerlabs.Ineachlab,wewalkthe readerthrougharealisticapplicationofthemethodsconsideredin thatchapter.Whenwehavetaughtthismaterialinourcourses, wehaveallocatedroughlyone-thirdofclassroomtimetoworking throughthelabs,andwehavefoundthemtobeextremelyuseful. Manyofthelesscomputationally-orientedstudentswhowereinitiallyintimidatedby R’scommandlevelinterfacegotthehangof thingsoverthecourseofthequarterorsemester.Wehaveused R becauseitisfreelyavailableandispowerfulenoughtoimplementall ofthemethodsdiscussedinthebook.Italsohasoptionalpackages thatcanbedownloadedtoimplementliterallythousandsofadditionalmethods.Mostimportantly, R isthelanguageofchoicefor academicstatisticians,andnewapproachesoftenbecomeavailablein

R yearsbeforetheyareimplementedincommercialpackages.However,thelabsinISLareself-contained,andcanbeskippedifthe readerwishestouseadiﬀerentsoftwarepackageordoesnotwishto applythemethodsdiscussedtoreal-worldproblems.

WhoShouldReadThisBook?

Thisbookisintendedforanyonewhoisinterestedinusingmodernstatisticalmethodsformodelingandpredictionfromdata.Thisgroupincludes scientists,engineers,dataanalysts,or quants,butalsolesstechnicalindividualswithdegreesinnon-quantitativeﬁeldssuchasthesocialsciencesor business.Weexpectthatthereaderwillhavehadatleastoneelementary courseinstatistics.Backgroundinlinearregressionisalsouseful,though notrequired,sincewereviewthekeyconceptsbehindlinearregressionin Chapter 3.Themathematicallevelofthisbookismodest,andadetailed knowledgeofmatrixoperationsisnotrequired.Thisbookprovidesanintroductiontothestatisticalprogramminglanguage R.Previousexposure toaprogramminglanguage,suchas MATLAB or Python,isusefulbutnot required.

Wehavesuccessfullytaughtmaterialatthisleveltomaster’sandPhD studentsinbusiness,computerscience,biology,earthsciences,psychology, andmanyotherareasofthephysicalandsocialsciences.Thisbookcould alsobeappropriateforadvancedundergraduateswhohavealreadytaken acourseonlinearregression.Inthe contextofamoremathematically rigorouscourseinwhichESLservesastheprimarytextbook,ISLcould beusedasasupplementarytextforteachingcomputationalaspectsofthe variousapproaches.

NotationandSimpleMatrixAlgebra

Choosingnotationforatextbookisalwaysadiﬃculttask.Forthemost partweadoptthesamenotationalconventionsasESL.

Wewilluse n torepresentthenumberofdistinctdatapoints,orobservations,inoursample.Wewilllet p denotethenumberofvariablesthatare availableforuseinmakingpredictions.Forexample,the Wage datasetconsistsof12variablesfor3,000people,sowehave n =3,000observationsand p =12variables(suchas year, age, wage,andmore).Notethatthroughout thisbook,weindicatevariablenamesusingcoloredfont: VariableName. Insomeexamples, p mightbequitelarge,suchasontheorderofthousandsorevenmillions;thissituationarisesquiteoften,forexample,inthe analysisofmodernbiologicaldataorweb-basedadvertisingdata.

of A and B isdenoted AB.The(i,j )thelementof AB iscomputedby multiplyingeachelementofthe ithrowof A bythecorrespondingelement ofthe j thcolumnof B.Thatis,(AB)ij = d k=1 aik bkj .Asanexample, consider

Notethatthisoperationproducesan r × s matrix.Itisonlypossibleto compute AB ifthenumberofcolumnsof A isthesameasthenumberof rowsof B.

OrganizationofThisBook

Chapter 2 introducesthebasicterminologyandconceptsbehindstatisticallearning.Thischapteralsopresentsthe K -nearestneighbor classifier,a verysimplemethodthatworkssurprisinglywellonmanyproblems.Chapters 3 and 4 coverclassicallinearmethodsforregressionandclassification. Inparticular,Chapter 3 reviews linearregression,thefundamentalstartingpointforallregressionmethods.InChapter 4 wediscusstwoofthe mostimportantclassicalclassificationmethods, logisticregression and lineardiscriminantanalysis. Acentralprobleminallstatisticallearningsituationsinvolveschoosing thebestmethodforagivenapplication.Hence,inChapter 5 weintroduce cross-validation andthe bootstrap,whichcanbeusedtoestimatethe accuracyofanumberofdifferentmethodsinordertochoosethebestone. Muchoftherecentresearchinstatisticallearninghasconcentratedon non-linearmethods.However,linearmethodsoftenhaveadvantagesover theirnon-linearcompetitorsintermsofinterpretabilityandsometimesalso accuracy.Hence,inChapter 6 weconsiderahostoflinearmethods,both classicalandmoremodern,whichofferpotentialimprovementsoverstandardlinearregression.Theseinclude stepwiseselection, ridgeregression, principalcomponentsregression, partialleastsquares,andthe lasso. Theremainingchaptersmoveintotheworldofnon-linearstatistical learning.WefirstintroduceinChapter 7 anumberofnon-linearmethods thatworkwellforproblemswithasingleinputvariable.Wethenshowhow thesemethodscanbeusedtofitnon-linear additive modelsforwhichthere ismorethanoneinput.InChapter 8,weinvestigate tree-basedmethods, including bagging, boosting,and randomforests. Supportvectormachines, asetofapproachesforperformingbothlinearandnon-linearclassification,

NameDescription

Auto Gasmileage,horsepower,andotherinformationforcars.

Boston HousingvaluesandotherinformationaboutBostonsuburbs.

Caravan Informationaboutindividualsoﬀeredcaravaninsurance.

Carseats Informationaboutcarseatsalesin400stores.

College Demographiccharacteristics,tuition,andmoreforUSAcolleges.

Default Customerdefaultrecordsforacreditcardcompany.

Hitters Recordsandsalariesforbaseballplayers.

Khan Geneexpressionmeasurementsforfourcancertypes. NCI60 Geneexpressionmeasurementsfor64cancercelllines.

OJ SalesinformationforCitrusHillandMinuteMaidorangejuice.

Portfolio Pastvaluesofﬁnancialassets,foruseinportfolioallocation.

Smarket DailypercentagereturnsforS&P500overa5-yearperiod. USArrests Crimestatisticsper100,000residentsin50statesofUSA.

Wage IncomesurveydataformalesincentralAtlanticregionofUSA.

Weekly 1,089weeklystockmarketreturnsfor21years.

TABLE1.1. Alistofdatasetsneededtoperformthelabsandexercisesinthis textbook.Alldatasetsareavailableinthe ISLR library,withtheexceptionof Boston (partof MASS)and USArrests (partofthebase R distribution).

Itcontainsanumberofresources,includingthe R packageassociatedwith thisbook,andsomeadditionaldatasets.

Acknowledgements

AfewoftheplotsinthisbookweretakenfromESL:Figures 6.7, 8.3, and 10.12.Allotherplotsarenewtothisbook.

2 StatisticalLearning

2.1WhatIsStatisticalLearning?

Inordertomotivateourstudyofstatisticallearning,webeginwitha simpleexample.Supposethatwearestatisticalconsultantshiredbya clienttoprovideadviceonhowtoimprovesalesofaparticularproduct.The Advertising datasetconsistsofthe sales ofthatproductin200diﬀerent markets,alongwithadvertisingbudgetsfortheproductineachofthose marketsforthreediﬀerentmedia: TV, radio,and newspaper.Thedataare displayedinFigure 2.1.Itisnotpossibleforourclienttodirectlyincrease salesoftheproduct.Ontheotherhand,theycancontroltheadvertising expenditureineachofthethreemedia.Therefore,ifwedeterminethat thereisanassociationbetweenadvertisingandsales,thenwecaninstruct ourclienttoadjustadvertisingbudgets,therebyindirectlyincreasingsales. Inotherwords,ourgoalistodevelopanaccuratemodelthatcanbeused topredictsalesonthebasisofthethreemediabudgets.

Inthissetting,theadvertisingbudgetsare inputvariables while sales input variable isan outputvariable.Theinputvariablesaretypicallydenotedusingthe output variable symbol X ,withasubscripttodistinguishthem.So X1 mightbethe TV budget, X2 the radio budget,and X3 the newspaper budget.Theinputs gobydiﬀerentnames,suchas predictors, independentvariables, features, predictor independent variable feature orsometimesjust variables.Theoutputvariable—inthiscase, sales—is variable oftencalledthe response or dependentvariable,andistypicallydenoted response dependent variable usingthesymbol Y .Throughoutthisbook,wewillusealloftheseterms interchangeably.

G.Jamesetal., AnIntroductiontoStatisticalLearning:withApplicationsinR, SpringerTextsinStatistics,DOI10.1007/978-1-4614-7138-7 2,

FIGURE2.1. The Advertising dataset.Theplotdisplays sales,inthousands ofunits,asafunctionof TV, radio,and newspaper budgets,inthousandsof dollars,for 200 diﬀerentmarkets.Ineachplotweshowthesimpleleastsquares ﬁtof sales tothatvariable,asdescribedinChapter 3.Inotherwords,eachblue linerepresentsasimplemodelthatcanbeusedtopredict sales using TV, radio, and newspaper,respectively.

Moregenerally,supposethatweobserveaquantitativeresponse Y and p diﬀerentpredictors, X1 ,X2 ,...,Xp .Weassumethatthereissome relationshipbetween Y and X =(X1 ,X2 ,...,Xp ),whichcanbewritten intheverygeneralform

Here f issomeﬁxedbutunknownfunctionof X1 ,...,Xp ,and isarandom errorterm,whichisindependentof X andhasmeanzero.Inthisformulaerrorterm tion, f representsthe systematic informationthat X providesabout Y systematic

Asanotherexample,considertheleft-handpanelofFigure 2.2,aplotof income versus yearsofeducation for30individualsinthe Income dataset. Theplotsuggeststhatonemightbeabletopredict income using yearsof education.However,thefunction f thatconnectstheinputvariabletothe outputvariableisingeneralunknown.Inthissituationonemustestimate f basedontheobservedpoints.Since Income isasimulateddataset, f is knownandisshownbythebluecurveintheright-handpanelofFigure 2.2. Theverticallinesrepresenttheerrorterms .Wenotethatsomeofthe 30observationslieabovethebluecurveandsomeliebelowit;overall,the errorshaveapproximatelymeanzero.

Ingeneral,thefunction f mayinvolvemorethanoneinputvariable. InFigure 2.3 weplot income asafunctionof yearsofeducation and seniority.Here f isatwo-dimensionalsurfacethatmustbeestimated basedontheobserveddata.