Statistical Thinking From Scratch: A Primer For Scientists M. D. Edge Visit to download the full and correct content document: https://ebookmass.com/product/statistical-thinking-from-scratch-a-primer-for-scientists -m-d-edge/
More products digital (pdf, epub, mobi) instant download maybe you interests ...
Statistical Modeling With R: A Dual Frequentist and Bayesian Approach for Life Scientists Pablo Inchausti
https://ebookmass.com/product/statistical-modeling-with-r-a-dualfrequentist-and-bayesian-approach-for-life-scientists-pabloinchausti/
Love from Scratch Kaitlyn Hill
https://ebookmass.com/product/love-from-scratch-kaitlyn-hill/
Love from Scratch Kaitlyn Hill
https://ebookmass.com/product/love-from-scratch-kaitlyn-hill-3/
Love from Scratch Kaitlyn Hill
https://ebookmass.com/product/love-from-scratch-kaitlyn-hill-2/
Python for Scientists (3rd Edition) John M. Stewart https://ebookmass.com/product/python-for-scientists-3rd-editionjohn-m-stewart/
Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking 4th Edition Harvey Motulsky
https://ebookmass.com/product/intuitive-biostatistics-anonmathematical-guide-to-statistical-thinking-4th-edition-harveymotulsky/
Coral Reefs of Australia: Perspectives from Beyond the Water's Edge Sarah M. Hamylton
https://ebookmass.com/product/coral-reefs-of-australiaperspectives-from-beyond-the-waters-edge-sarah-m-hamylton/
Essential MATLAB for Engineers and Scientists Brian D. Hahn & Daniel T. Valentine
https://ebookmass.com/product/essential-matlab-for-engineers-andscientists-brian-d-hahn-daniel-t-valentine/
Pro Kotlin Web Apps from Scratch August Lilleaas
https://ebookmass.com/product/pro-kotlin-web-apps-from-scratchaugust-lilleaas/
StatisticalThinking fromScratch StatisticalThinking fromScratch APrimerforScientists M.D.EDGE GreatClarendonStreet,Oxford,OX26DP, UnitedKingdom
OxfordUniversityPressisadepartmentoftheUniversityofOxford. ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship, andeducationbypublishingworldwide.Oxfordisaregisteredtrademarkof OxfordUniversityPressintheUKandincertainothercountries ©M.D.Edge2019
Themoralrightsoftheauthorhavebeenasserted FirstEditionpublishedin2019
Impression:1
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted bylaw,bylicenceorundertermsagreedwiththeappropriatereprographics rightsorganization.Enquiriesconcerningreproductionoutsidethescopeofthe aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe addressabove
Youmustnotcirculatethisworkinanyotherform andyoumustimposethissameconditiononanyacquirer
PublishedintheUnitedStatesofAmericabyOxfordUniversityPress 198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica BritishLibraryCataloguinginPublicationData
Dataavailable
LibraryofCongressControlNumber:2019934651
ISBN978–0–19–882762–7(hbk.)
ISBN978–0–19–882763–4(pbk.)
DOI:10.1093/oso/9780198827627.001.0001
Printedandboundby CPIGroup(UK)Ltd,Croydon,CR04YY
LinkstothirdpartywebsitesareprovidedbyOxfordingoodfaithand forinformationonly.Oxforddisclaimsanyresponsibilityforthematerials containedinanythirdpartywebsitereferencedinthiswork.
3.1De
5.1Expectedvaluesandthelawoflargenumbers60
5.3Jointdistributions,covariance,andcorrelation
5.4[Optionalsection]Conditionaldistribution,expectation,andvariance72
5.5Thecentrallimittheorem
5.6Aprobabilisticmodelforsimplelinearregression
6.6[Optionalsection]Statisticaldecisiontheoryandrisk
7.1Standarderror111
7.2Confidenceintervals
7.3FrequentistinferenceI:nullhypotheses,teststatistics,and p values116
7.4FrequentistinferenceII:alternativehypothesesandtherejection framework
7.5[Optionalsection]Connectinghypothesistestsandconfidenceintervals124
7.6NHSTandtheabuseoftests
7.6.1Lackofreplication
7.6.4Identificationofscientifichypothesiswithastatisticalhypothesis126
7.6.5Neglectofothergoals,suchasestimationandprediction
7.6.6Adegradedintellectualculture
7.6.7EvaluatingsignificancetestsinlightofNHST
7.7FrequentistinferenceIII:power 131
7.8Puttingittogether:Whathappenswhenthesamplesizeincreases?135
7.9Chaptersummary 137
7.10Furtherreading 137
8Semiparametricestimationandinference
8.1Semiparametricpointestimationusingthemethodofmoments142
8.1.1Plug-inestimators143
8.1.2Themethodofmoments145
8.2Semiparametricintervalestimationusingthebootstrap149
Box8-1:Bootstrappivotalintervals 154
8.3Semiparametrichypothesistestingusingpermutationtests 157
8.4Conclusion 162
9Parametricestimationandinference
Box9-1:Logarithms
9.1Parametricestimationusingmaximumlikelihood 168
9.1.1Maximum-likelihoodestimationforsimplelinearregression172
9.2Parametricintervalestimation:thedirectapproachandFisher information 175
9.2.1Thedirectapproach 175
9.2.2[Optionalsubsection]TheFisherinformationapproach 177
9.3ParametrichypothesistestingusingtheWaldtest 180
9.3.1TheWaldtest 180
9.4[Optionalsection]Parametrichypothesistestingusingthe likelihood-ratiotest 181
9.5Chaptersummary 185
10Bayesianestimationandinference 10.1Howtochooseapriordistribution?187
10.2Theunscaledposterior,conjugacy,andsamplingfromtheposterior188
10.3BayesianpointestimationusingBayesestimators 193
10.4Bayesianintervalestimationusingcredibleintervals 196
10.5[Optionalsection]Bayesian “hypothesistesting” usingBayesfactors198
10.6Conclusion:Bayesianvs.frequentistmethods 201
10.7Chaptersummary 202 10.8Furtherreading 202
B.2Datastructuresanddataextraction
B.2.3Dataframes
B.3
B.3.1Gettinghelp
B.3.2Datacreation
B.3.3Variableinformation
B.3.4Math
B.3.5Plotting
B.3.6Optimizationandmodel
B.3.7Distributions
B.3.8Programming
B.3.9Datainputandoutput(I/O)
Acknowledgments Whenonethinksaboutprobabilityandrandomprocesses,one’smindsometimeswanders towardthecontingenciesinherentinlife.Almostanyeventinone’sbiographymighthave happenedinaslightlydifferentway,andtheunrealizedoutcomesmighthavebranched intoamuch-alteredstory.Togiveoneexampleoutofmanyfrommyownlife,Itookaclass incollegethatshapedmycareertrajectory,withoutwhichthisbookwouldnothavebeen written.Thepersonwhotoldmetheclasswouldbeofferedwasafriendofmyroommate’s whohappenedtovisitthedaybeforetheapplicationwasdue apersonIhadnotmet beforeandhaven’tseensince.Hadmyroommate’sfriendnotvisited,Iwouldhavegoneon toliveanotherlifethatwouldhavefollowedanautumnterminwhichItookadifferent class.Immediatelymyroommate,hisfriend,theclassinstructors,andallthepeoplewho setthemonacourseinwhichtheywouldinteractwithmeareimplicatedinthewritingof thisbook.Therearethousandsofothersucheventsthatinfluencedthisbookinshallowor deepways,andeachofthoseeventswasitselfconditionalonachainofcontingencies. Takingthisview,itisnotmysticaltosaythatbillionsofpeople includingthelivingand theirancestors areconnectedtothisbookthroughawebofinteractions.Thepurposeof anacknowledgmentssection,then,isnottoseparatecontributorsfromnon-contributors, buttoproducealistofthefewpeoplewhoserolesappearmostsalientaccordingtothe faultymemoryoftheauthor.
MyeditorsatOxfordUniversityPress,IanShermanandBethanyKershaw,haveguided methroughtheprocessofimprovingand finalizingthisbook.Ihavealwaysfeltthatthe bookhasbeensafeintheirhands.MacClarke’sperceptivecopyeditsrescuedmein morethanafewplaces.TheSPitypesettingteamablyhandledproduction,managedby SaranyaJayakumar.AlexWalkerdesignedthecover,whichhaswonmemanyunearned compliments.
TimCampellone,PaulConnor,AudunDahl,CarolynFredericks,ArbelHarpak,Aaron Hirsh,SusanJohnston,EmilyJosephs,JaeheeKim,JoeLarkin,JeffLong,SandyLwi, KoshlanMayer-Blackwell,TomPamukcu,BenRoberts,MaimonRose,WilliamRyan,Rav Suri,MikeWhitlock,RasmusWinther,YeXia,AlisaZhao,severalanonymousreviewers, andmyPsychology102studentsatUCBerkeleyallreadsectionsofthebookandprovideda combinationofencouragementandconstructivecomments.Thebookisvastlyimproved becauseoftheirinput.ArbelHarpak,AaronHirsh,andSandyLwiwarrantspecialmention fortheirdetailedcommentsonlargeportionsofthetext.Mywonderfulstudentsalso deservecelebrationfortheirbraveryinbeingthe firstpeopletorelyonthisbookasa text,asdoesRavSuriforbeingthe firstinstructor(afterme)toadoptitforacourse.
ThisbookwouldhaveremainedmerelyanideaforabookifitwerenotforAaronHirsh andBenRoberts.AaronandBen alongwithmywife,IsabelEdge convincedmethat Imightbeabletowritethisbookandthatitwasnotentirelyridiculoustotry.Theywere alsobothinstrumentalinshapingthecontentandframing,andtheyguidedmeas Inavigatedthepossibilitiesforpublication.Alongsimilarlines,MelanieMitchellhelped mearriveattheschemethatkeptmewritingconsistentlyforyears requiring1,000words ofnewtextfrommyselfeachweek,withthepenaltyforderelictiona$5donationtoaproastrologyoutfit.
IwouldnothavehadtheideatowritethisbookifIhadnotbeenabletotrainasbothan empiricalresearcherandasadeveloperofstatisticalmethods.Mymentorsinthese fields includingTriciaClement,GrahamCoop,EmilyDrabant,RussFernald,JamesGross,Sheri Johnson,VivekaRamel,andNoahRosenberg madethispossible.Myinterestswere seededbywonderfulteachers,includingCariKaufman,whoseintroductoryclassinmathematicalstatisticsgavemeafeelingofempowermentthatIhavewantedtoshareeversince, RonPeet,whodidthemosttoshapemyinterestinmathandgavememy firstexposureto statistics,andBinYu,whotaughtmemorethanIthoughtpossibleinonesemesterabout workingwithdata.Gary “GH” Haneltaughtmecalculus,andIhavecribbedhissticky phrasesconveyingtherulesfordifferentiatingandintegratingpolynomials—“outinfront anddownbyone,” and “upandunder”—after findingthatthey(andseveralofhisother memoryaids)alwaysstaywithme,nomatterhowmuchcalculusIforgetandrelearn.
Finally,myfamilyhassupportedmeduringtheprocessofwriting(andofpreparingto write).Myparents,ChloeandDon,havealwayssupportedmeinmygoalsforlearningand growth.Isabel,mywife,hashelpedmebelievethatmyeffortsareworthwhileandhasbeen alistenerandcounseloroveryearsofwriting.Maceo,ourthree-year-old,hasnotprovided anyinputthatIhaveincorporatedintothetext,butwelikehimverymuchnonetheless.
Prelude Practitionersofeveryempiricaldisciplinemustlearntoanalyzedata.Moststudents’ first andsometimesonly trainingindataanalysiscomesfromacourseofferedbytheirhome department.Insuchcourses,the firstfewweeksaretypicallyspentteachingskillsfor readingdatadisplaysandsummarizingdata.Theremainderofthecourseisspentdiscussingasequenceofstatisticalteststhatarerelevantforthe fieldinwhichpractitionersare beingeducated:acourseinapsychologydepartmentmightfocuson t -testsandanalysisof variance(ANOVA);aneconomicscoursemightdeveloplinearregressionandsomeextensionsaimedatcausalinference;futurephysiciansmightlearnaboutsurvivalanalysisand Coxmodels.Thereareatleastthreeadvantagestothisapproach.First,giventhatstudents maytakeonlyonecourseindataanalysis,itisreasonabletoteachtheskillstheyneedtobe functionaldataanalystsasquicklyaspossible.Second,coursesthatfocusonprocedures usefulinthestudents’ majorareaofstudyallowinstructorstopickrelevantexamples, encouraginginterest.Third,courseslikethesecanteachdataanalysiswhilerequiringno mathematicsbeyondarithmetic.
Atthesametime,theintroductionoftestaftertestinthesecondpartofthecoursecomes withmajordrawbacks.First,asinstructorsfrequentlyhearfromtheirstudents,test-aftertestlitaniescanbedifficulttounderstand.Thematerialthatconceptuallyunitesthe procedureshasbeensqueezedintoashorttime.Asaresult,fromthestudents’ view,each procedureisasubjectuntoitself,anditisdifficulttodevelopanintegratedviewof statisticalthinking.Second,formotivatedstudents,standardintroductorysequencescan givetheimpressionthat,thoughdatamaybeexciting,statisticsisuninteresting.Forthe student,itcanseemasthoughtomasterstatisticsistomemorizeavasttreeofassumptions andhypotheses,allowingonetodrawtheappropriatetestfromadeckwhencertain conditionsaremet.Studentswholearnthisstyleofdataanalysiscannotbeblamedfor failingtoseethatthedisciplineofstatisticsisstimulatingoreventhatitisintellectually rooted.Third,theabilitytoapplyafewwell-chosenproceduresmayallowthestudentto becomeafunctionalresearcher,butitisaninsufficientfoundationforfuturegrowthasa dataanalyst.Wehavetaughtasetofrecipes andversatile,germaneones butwehave nottrainedachef.Whennewproceduresarise,itwillbenoeasierforourstudenttolearn themthanitwasforhimtolearnhis firstsetofprocedures.Thatistosayitwillrequirereal labor,andsuccesswilldependonanexpositionwrittenbysomeonewhocantranslate statisticalwritingintothelanguageofhis field.
Mostuniversitystatisticsdepartmentstrainfuturestatisticiansdifferently.First,they requiretheirstudentstotakesubstantialcollege-levelmathbeforebeginningtheircourses instatistics.Atminimum,calculusisrequired,usuallywithmultivariablecalculusand linearalgebra,andperhapsacourseinrealanalysis.Aftermeetingthemathematical requirements,futurestatisticianstakeanentirecourse ortwo dedicatedstrictlytoprobabilitytheory,followedbyacourseinmathematicalstatistics.Afteratleastayearof
university-levelmathematicalpreparationandayearofstatisticscourses,thefuturestatisticianhasneverbeenaskedtoapply andperhapsneverevenheardof proceduresthe futurepsychologist,forexample,hasappliedmidwaythroughanintroductorycourse.
Atthisstageinherdevelopment,thewell-trainedstatisticsstudentmaynothaveapplied three-wayANOVA,forexample,butshedeeplyunderstandsthetechniquesshedoesknow, andsheseestheinterestandcoherenceofstatisticsasadiscipline.Moreover,shouldshe needtouseathree-wayANOVA,shewillbeabletolearnitquicklywithlittleornooutside assistance.
Howcanthebuddingresearcher pressedfortime,perhapsminimallytrainedinmathematics,andneedingtoapplyandinterpretavarietyofstatisticaltechniques gainsomethingofthecomfortableunderstandingandversatilityofthestatistician?Thisbook proposesthattheresearchworkeroughttolearnatleastoneprocedureindepth, “from scratch.” Thisexercisewillimpartanideaofhowstatisticalproceduresaredesigned,a flavor forthephilosophicalpositionsoneimplicitlyassumeswhenapplyingstatisticsinresearch, andaclearersenseofthestrengthsandweaknessesofstatisticaltechniques.
Thoughitcannotturnanon-statisticianintoastatistician,thisbookwillprovidea glimpseoftheconceptualframeworkinwhichstatisticiansaretrained,addingdepthand interesttowhatevertechniquesthereaderalreadyknowshowtoapply.Itisperhapsmost naturallyusedasamainorsupplementarytextinanadvancedintroductorycourse for beginninggraduatestudentsoradvancedundergraduates,forexample orforasecond courseindataanalysis.Iassumethereaderalreadyhasaninterestinunderstandingthe reasoningunderlyingstatisticalprocedures,asenseoftheimportanceoflearningfrom data,andsomefamiliaritywithbasicdatadisplaysanddescriptivestatistics.Priorexposure tocalculusandprogrammingarehelpfulbutnotrequired themostrelevantconceptsare introducedbrieflyinChapter2andinAppendicesAandB.Probabilitytheoryistaughtas neededandnotassumed.Insomedepartments,thebookwouldbesuitablefora first course,butthemathematicaldemandsarehighenoughthatinstructorsmay findthey prefertousethebookwithdeterminedstudentswhoarealreadyinvestedinempirical research.Anotherpossibleadjustmentistosplitthecourseintotwoterms,withthe Interludeservingasapreludetothesecondcourse,supplementingwithdataexamples fromthestudents’ field.Thebookisalsousefulasaself-studyguideforworkingresearchers seekingtoimprovetheirunderstandingofthetechniquestheyusedaily,orforprofessionalswhomustinterpretresultsfromresearchstudiesaspartoftheirwork.
Therearemanyexcellentstatisticstextbooksavailablefornon-statisticians,soanynew bookmustmakeplainhowitdiffersfromothers.Thisbookhasasetoffeaturesthatarenot universalandthatincombinationmightbeunique.
First,thisbookfocusesoninstructioninexactlyonestatisticalprocedure,simplelinear regression.Theideaisthatbylearningoneprocedurefromscratch,consideringtheentire conceptualframeworkunderlyingestimationandinferenceinthisonecontext,onecan gaintools,understanding,andintuitionthatwillapplytoothercontexts.Inaneraofbig data,wearekeepingthedatasmall twovariables,andinthedatasetweusemostoften, only11observations andthinkinghardaboutit.Insayingthatwework, “fromscratch,” Imeanthatweattempttotakelittleforgranted,exploringasmanyfundamentalquestions aspossiblewithacombinationofmath,simulations,thoughtexperiments,andexamples. Ichosesimplelinearregressionastheproceduretoanalyzebothbecauseitismathematically simpleandbecausemanyofthemostwidelyappliedstatisticaltechniques including ttests,multipleregression,andANOVA,aswellasmachine-learningmethodslikelassoand ridgeregression canbeviewedasspecialcasesorgeneralizationsofsimplelinearregression.Afewofthesegeneralizationsaresketchedorexemplifiedinthe finalchapter.
Asecondfeatureisthemathematicallevelofthebook,whichisgentlerthanmosttexts intendedforstatisticiansbutmoredemandingthanmostintroductorystatisticstextsfor non-statisticians.Onegoalistoserveasabridgeforstudentswhohaverealizedthatthey needtolearntoreadmathematicalcontentinstatistics.Learningtoreadsuchcontent unlocksaccesstoadvancedtextbooksandcourses,anditalsomakesiteasiertotalkwith statisticianswhenadviceisneeded.AsecondreasonforincludingasmuchmathasIhaveis toincreasethereader’sinterestinstatistics manyoftherichestideasinstatisticsare expressedmathematically,andifonedoesnotengagewithsomemath,itistooeasyfora statisticscoursetobecomealistofprescriptions.Themainmathematicalrequirementis comfortwithalgebra(or,atleast,neglectedalgebraskillsthatcanbedustedoffandputback intoservice).Thebookrequiressomefamiliaritywiththemainideasofcalculus,whichare introducedbrieflyinAppendixA,butitdoesnotrequiremuchfacilitywithcalculus problems.Themathematicaldemandsincreaseasthebookprogresses(uptoabout Chapter9),onthetheorythatthereadergainsskillsandconfidencewitheachchapter. Nearlyallequationsareaccompaniedbyexplanatoryprose.
Third,someoftheproblemsinthisbookareanintegralpartofthetext.Themajorityof theincludedproblemsareintendedtobeearnestlyattemptedbythereader exceptions aremarkedasoptional.Everysolutionisincluded,eitherinthebackofthebookoratthe book’sGitHubrepository,github.com/mdedge/stfs/,orcompanionsite,www.oup.co.uk/ companion/edge.1 Theproblemsareinterspersedthroughthetextitselfandarepartofthe exposition,providingpractice,proofsofkeyprinciples,importantcorollaries,andpreviews oftopicsthatcomelater.Manyoftheproblemsaredifficult,andstudentsshouldnotfeel theyarefailingifthey findthemtough theprocessofmakinganattemptandthen studyingthesolutionsismoreimportantthangettingthecorrectanswers.Agoodapproachforstudentsistospendroughly75%oftheirreadingtimeattemptingtheexercises, referringtothesolutionfornextstepswhenstuckformorethanafewminutes.
Fourth,severaloftheproblemsarecomputationalexercisesinthefreestatisticalsoftware package R.Manyoftheseproblemsinvolvetheanalysisofsimulateddata.Therearetwo reasonsforthis.First, R isthestatisticallanguageofchoiceamongstatisticians.Asofthis writing,itisthemostversatilestatisticalsoftwareavailable,andaspiringdataanalysts oughttogainsomecomfortinit.Secondly,itispossibletoanswerdifficultstatistical questionsin R usingsimulations.Whenoneispracticingstatistics,oneoftenencounters questionsthatarenotreadilyansweredbythemathematicsortextswithwhichoneis familiar.Whenthishappens,simulationoftenprovidesaserviceableresolution.Readersof thisbookwillusesimulationinwaysthatsuggestusefulanswers.All R codetoconductthe demonstrationsinthetext,completetheexercises,andmakethe figuresisavailableatthe book’sGitHubrepository,github.com/mdedge/stfs/,andthereisalsoan R packageincludingfunctionsspecifictothebook(installationinstructionscomeattheendofChapter2,in ExerciseSet2-2,Problem3).
Thechaptersareintendedtobereadinsequence.Jointly,theyintroducetwoimportant usesofstatistics:estimation,whichistousedatatoguessthevaluesofparametersthat describeadata-generatingprocess,andinference,whichistotesthypothesesaboutthe processesthatmighthavegeneratedanobserveddataset.(Athirdkeyapplication,prediction,isaddressedbrieflyinthePostlude.)Bothestimationandinferencerelyontheidea thattheobserveddataarerepresentativeofmanyotherpossibledatasetsthatmighthave
1 InstructorscanobtainseparatequestionsIhaveusedforhomeworkandexamswhenteachingfromthis book.Emailtheauthortorequestadditionalquestions.
beengeneratedbythesameprocess.Statisticiansformalizethisideausingprobability theory,andwewillstudyprobabilitybeforeturningtoestimationandinference.
Chapter1presentssomemotivatingquestions.Chapter2isatutorialonthestatistical softwarepackage R .(AdditionalbackgroundmaterialisinAppendicesAandB,with AppendixAdevotedtocalculusandAppendixBtocomputerprogrammingand R.) Chapter3introducestheideaofsummarizingdatabydrawingaline.Chapters4and5 coverprobability.AnInterludechaptermarksthetraditionaldividebetweenprobability theoryandstatistics.Chapter6coversestimation,immediatelyputtingthefourthand fifth chapterstowork.Chapter7coversinference:whatkindsofstatementscanwemakeabout theworldgivenamodel(Chapters4and5)andasample?Chapters8,9,and10describe threebroadapproachestoestimationandinference.Thesethreeperspectivessharegoals andsharetheframeworkofprobabilitytheory,buttheymakedifferentassumptions.Inthe Postlude,Idiscusssomeextensionsofsimplelinearregressionandpointoutafewpossible directionsforfuturelearning.
CHAPTER1 Encounteringdata Keyterms: Computation,Data,Statistics,Simplelinearregression, R.
Ifwetakeinourhandanyvolume,...letusask,Doesitcontainanyabstract reasoningconcerningquantityornumber?No.Doesitcontainanyexperimentalreasoningconcerningmatteroffactandexistence?No.Commitit thentothe flames:foritcancontainnothingbutsophistryandillusion.
DavidHume, AnEnquiryConcerningHumanUnderstanding
(1748)
Inthepassagequotedhere,Humeclaimsthattherearetwotypesofargumentsweshould consideraccepting:mathematicalreasoningandempiricalreasoning reasoningbasedon observationsabouttheworld.ThespecificsofHume’sclaimsarebeyondourscope,but usersofstatisticsaresafefromfollowersofHume’sdictum.Thisbookwillcontainsome reasoningconcerningnumber,anditwillcontainexamplesof “experimental” reasoning aboutobservedfacts.Thus,agoodHumeanneednotcommitittothe flames.
Hume’sstatementgivesusawayofthinkingaboutthesubjectofstatistics.Wewantto makeclaimsabouttheworldonthebasisofobservations.Forexample,wemightwantto knowwhetheracorollaryofthewavetheoryoflightmatchestheresultsofanexperiment. Wemightaskwhetheranewtherapyfordiabetesiseffective.Wemightwanttoknow whetherpeoplewithcollegedegreesearnmorethantheirpeerswhodonotgraduate. Whetherwearepursuingphysicalscience,biologicalscience,socialscience,engineering, medicine,orbusiness,weconstantlyneedanswerstoquestionswithempiricalcontent.
ButwhatisHume’s “reasoningconcerningmatteroffact”?Collectingdataaboutthe worldisonething,butusingthosedatatomakeconclusionsisanother.ConsiderFigure1-1, whichpurportstoshowdataonfertilizerconsumptionandcerealyieldsin11sub-Saharan Africancountries.1 Inmanysub-SaharanAfricancountries,soilnitrogenlimitsagricultural yield.Onewaytoincreasesoilnitrogenistoapplyfertilizer.Eachofthe11countriesinthe datasetisrepresentedbyapointontheplot.Foreachpoint,the x coordinate thatis,the positiononthehorizontalaxis indicatesthecountry’sfertilizerconsumptioninagiven year.The y coordinate thepositionontheverticalaxis representsthecountry’syieldof cerealgrainsinthatsameyear.SupposethatonthebasisofFigure1-1,Iclaimthatthereisa robustrelationshipbetweenfertilizerconsumptionandcerealyieldamongcountriessimilar tothe11countrieswehavesampled.(NotethatIhavenotmadeclaimsaboutanycausal
1 Thesedataarefake moreontheirsourceattheendofthechapter butthequestionisreal.Thedatain Figure1-1dolooselyresembleactualdatafromsomesub-SaharanAfricancountrieswithlowgrainyieldsfrom 2008to2010.Forexample,in2010,farmersinMozambiqueconsumedabout9kg/hectareoffertilizer,andthe cerealyieldwasabout945kg/hectare.ActualdataareavailablefromtheWorldBank.
StatisticalThinkingfromScratch:APrimerforScientists.M.D.Edge,OxfordUniversityPress(2019). ©M.D.Edge2019.DOI:10.1093/oso/9780198827627.001.0001
relationshipsthatmightexplaintherelationship;Ihavemerelypositedthattherelationship “exists” insomesense.)
Supposeyoudisagree.Youmightcounterthattheplotisn’timpressive:therearen’t manydata,therelationshipbetweenthevariablesstrikesyouasweak,andyoudon’t knowthesourceofthedata.Thesecountersareatleastpotentiallylegitimate.Youand Iareatanimpasse,DearReader:Ihavemadeaclaimbasedondata,andyouhavelookedat thesamedataandmadeadifferentclaim.Withoutmethodsforreasoningaboutdata,itis unclearhowtomakefurtherprogressregardingourdisagreement.
Howcanwedevelopconceptsforreasoningfromdata?Thedisciplineofstatistics providesoneanswertothisquestion.StatisticstakesHume’sothercandidatefornonillusoryknowledge mathematicalreasoning andbuildsamathematicalframeworkin whichwecansetthedata.2 Oncewehaveframedtheproblemofreasoningaboutdata mathematically,wecanmakeclaimsbyadoptingassumptionsandthenusingmathematicalreasoningtoproceed.Thestatusoftheclaimswemakewillusuallyhingeonthe appropriatenessoftheassumptionsweusetogetstarted.Asyoureadaboutstatistical approachesforreasoningaboutdata,considerwhetherandunderwhatcircumstances theyareadequate.Wewillrevisitthisquestioninvariousforms.
InthePrelude,Ipromisedthatthisbookwouldbeonlylightlymathematical,yethere, Ihaveproposedthatstatisticsisawaytoharnessmathematicalthinkingtoreasonabout data.Howwillthisbookhelpreadersstrengthentheirstatisticalunderstandingwithout engaginginheavymathematics?
Statisticstookshapeasadisciplinebeforemoderncomputerswereavailable,withmany oftheideasmostimportanttothisbookappearinginthelatenineteenthandearly twentiethcenturies.Manyofthemostimportantstatisticiansofthiserawerewell-trained mathematicians.Withlimitedcomputingpowerbutamplemathematicaltraining, theyapproachedthedevelopmentoftheirsubjectmathematically.Today,advancesin computingallowthoseofuswithlimitedmathematicaltrainingtoanswerquestionsthat
2 Thisisnottosuggestthatstrongmathematicalreasoningskillsarethesoleorevenmostimportant qualificationofadataanalyst.Facilitywithdataandcomputers,subject-areaknowledge,scientificacumen, andcommonsenseareallimportant.
Fertilizerconsumption(kg/hectare) Cereal
Figure1-1 Fertilizerconsumptionandcerealyieldin11sub-SaharanAfricancountries.
wouldbedifficultevenforaseasonedmathematiciantoapproachdirectly.Wewilluse computationtoanswerstatisticalquestionsthatwillnotyieldtoelementarymath.
1.1Thingstocome Beforewebegininearnest,let’stakeamomenttoanticipatethemajortopicswe’llconsider intherestofthebook,motivatedbythedatainFigure1-1.Wewillbefocusedon understandingsimplelinearregression,whichentailsidentifyingalinethat “fits” the data,passingthoughthecloudofdatapointsinthe figure.Linearregression including bothsimplelinearregressionanditsgeneralization,multipleregression isperhapsthe mostwidelyusedmethodinappliedstatistics,especiallywhenitsspecialcasesareconsidered,including t-tests,correlationanalysis,andanalysisofvariance(ANOVA).Itonly takesafewcommandstorunasimplelinearregressionanalysisinthestatisticalsoftware R. (Atutorialin R iscominginthenextchapter.)Thedataarestoredinan R objectcalled anscombe,andto fitthelinearregressionmodel,werun
mod.fit<-lm(y1~x1,data=anscombe)
Aswillbediscussedlater,the lm() function fitsalinearregressionmodeltoadataset.By “fitting” aregressionmodel,we findalinethat “best” fitsthedatashowninFigure1-1.You canseethedatafromFigure1-1withthe “best fit” linedrawninbytyping plot(anscombe$x1,anscombe$y1) abline(mod. fit)
Theplotwiththelinedrawnin(plusafewimprovementstolabelingandaesthetics3) isshowninFigure1-2.The plot() functionproducesascatterplot thatis,aplot withpointslocatedtoindicatevaluesoftheattributesrepresentedbythe x and y axes.
Figure1-2 TheagriculturedatafromFigure1-1,withthelineof “ best ” fi tfromthesimplelinear regressionmodel.
3 Codeforgeneratingallthebook’s R figuresisavailableatgithub.com/mdedge/stfs.
The abline() functiondrawsthelineimpliedbythelinearregressionmodel.Thesensein whichthislinecanbedescribedasthe “best” fitisthesubjectofChapter3.Aswewillsee, thereareactuallymanydifferentlinesthatcouldbedescribedas “best.” Thelinein Figure1-2isbestaccordingtoacriterionthathasalonghistoryinstatistics.
Thelinethat’sdrawninFigure1-2hasanequation,meaningthatitcanbedescribedas y = a + bx,where a and b areconstantnumbers.Inwords,to findthe y coordinateoftheline atanyvalue x,onestartswith a andthenaddstheproductof b and x.Thevaluesof a and b forthepicturedlinearegivenintheoutputofthe summary() function: summary(mod. fit)
Theoutputis
Call:
lm(formula=y1~x1,data=anscombe)
Residuals: Min1QMedian3QMax
-1.92127-0.45577-0.041360.709411.83882
Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)3.00011.12472.6670.02573* x10.50010.11794.2410.00217**
Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1
Residualstandarderror:1.237on9degreesoffreedom MultipleR-squared:0.6665,AdjustedR-squared:0.6295 F-statistic:17.99on1and9DF,p-value:0.00217
Thekeypartoftheoutputistheregressiontable,whichisprintedinboldhere.Inthe first columnofthetable,labeled “Estimate,” weseethenumbers3and0.5.Thesearethevalues oftheintercept, a,andslope, b,ofthelineinFigure1-2.Theword “estimate” suggestsaway ofviewingthelineinFigure1-2thatisdifferentfromtheonesuggestedearlier.Iinitially suggestedthatthelineinFigure1-2istheonethat “best” fitsthedatainsomesense in otherwords,thatitisadescriptionofthesample.That’strue.Theword “estimate” suggests thatitis also aguessaboutsomeunknownquantity,perhapsaboutapropertyofalarger populationorprocessthatthesampleissupposedtoreflect.Ifweassumethatdataare generatedbyaparticularprocess,thenwecanmakeclaimsaboutthetypesofdatathat mightresult.ThatisthetopicofChapters4and5,onprobabilitytheory.Andoncewehave decidedthatwereallydowanttomakeestimates thatis,tousedatatolearntheparametersofanassumedunderlyingprocess howshouldwedesignproceduresforestimation? Whatpropertiesshouldtheseprocedureshave?ThatisthesubjectofChapter6.
Movingrightwardinthetable,weseeacolumnlabeled “Std.Error,” anabbreviationfor “standarderror.” Thestandarderrorisanattempttoquantifytheprecisionofanestimate. Inaspecificsensedevelopedlater,thestandarderrorrespondstothequestion, “Ifwewere tosampleanotherdatasetfromthesamepopulationasthis,byabouthowmuchmightwe expecttheestimatetovary?” Sothe0.1179inthetablesuggeststhatifwedrewanother sampleof11pointsgeneratedbythesameunderlyingprocess intheexample,perhaps agriculturaldatafrom11othersub-SaharanAfricancountries weshouldnotbesurprised iftheslopeofthebest-fitlinediffersby~0.12fromthecurrentestimate.Theattemptto identifytheprecisionofestimatesiscalled “intervalestimation,” anditisoneofthetopics ofChapter7.
The finalcolumnoftheregressiontableislabeled “Pr(>|t|);” thesenumbersarecalled “p values.” Theirinterpretationissubtleandoftenbotched.Loosely, p valuesmeasurethe plausibilityofthedataunderaspecifichypothesis.Thehypothesesbeingtestedhereare thatthedatawereactuallygeneratedbyaprocessdescribedbyalinewithintercept0(first row)orslope0(secondrow).Low p values,liketheonesinthetable,suggestthat(a)the hypothesesarefalse,or(b)someotherassumptionentailedbythehypothesistestiswrong, or(c)anunlikelyeventoccurred.Hypothesistestingistheothersubject besidesinterval estimation ofChapter7.
Ihavealludedto “underlying” assumptions,andthegoalofmuchoftherestofthebook istoillustratethewaysinwhichsuchmodelingassumptionsareinvolvedinstatistical analysis.Dependingontheassumptionsthatthedataanalystcanjustify,differentsetsof statisticalproceduresbecomeavailable.InChapters8,9,and10,weapplydifferentsetsof assumptionstothedatasettoarriveatdifferentproceduresforpointestimation,interval estimation,andhypothesistesting.Theassumptionsunderlyingthestandardregression tableproducedby lm() arethesameasthoseusedinChapter9.InthePostlude,weconsider waysinwhichtheprinciplesdevelopedinthebookapplytostatisticalanalysesofothersorts ofdatasets,onesthatarenotanatural fitforsimplelinearregressionanalysis.
I’dliketoclosethischapterwithanargumentforundergoingsuchanextendedmeditationonsimplelinearregression.Afterall,empiricalresearchersarebusy,andit’spossible toteachheuristicinterpretationsofstatisticslikethoseintheregressiontable.Such teachingisquick,anditallowsresearcherstogettoserviceableanswers,atleastinthe easiestcases.Whyspendsomuchtimeonstatisticalthinking?Whynotoutsourcethe theorytoprofessionalstatisticians?
Theanswerisparthoney,partvinegar.Onthepositiveside,dataanalysisismuchmore funandinterestingwhentheanalysthasagenuinesenseofwhatsheisdoing.Withsome understandingofstatisticaltheory,it’spossibletorelatescientificclaimstotheirempirical basis,connectingthedatatothemathematicalframeworkthatjustifiestheclaim.That breedsconfidenceinresearchers,anditalsoallowsforcreativity.Ifyouknowhowthe machineworks,youcantakeitapartandrepurposeit.
Incontrast,roteapproachesthatrelyonheuristicsalonecanbeunsatisfying,anxietyinducing,creativelystifling,and/orgenuinelydangerous.Hereisacautionaryanecdote. Supposeyou fitalinearmodel,aswedidbefore,toanewdataset.Whereasbeforewe fita simplelinearregressiontothevariables y1 and x1 inthe anscombe dataset,wenowwork withvariables y3 and x3:
mod.fit2<-lm(y3~x3,data=anscombe) summary(mod. fit2)
Theresultingregressiontableincludes
Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)3.00251.12452.6700.02562* x30.49970.11794.2390.00218**
Eachentryinthetableisnearlyidenticaltoitscounterparttablefortheearliermodel.Fora rotedataanalystrelyingonjustthemodelsummary,theinterpretationofthesemodels wouldthereforebethesame.Butlookatthedataunderlyingthisanalysis,shownin Figure1-3.WhereasthedatainFigure1-2seemtoberandomlyscatteredaroundtheline, pointsinFigure1-3appeartofollowamuchmoresystematicpattern,withtheexceptionof
onepointthatdepartsfromit.WhereasthelineinFigure1-2seemslikeanappropriate summaryofthedata,Figure1-3suggestsaneedforseriousinspection.Whatiswiththat outlyingpoint?Whyaretheothersarrangedinaperfectlystraightline?Thesequestionsare urgent,andtheregressiontablecan’tbeinterpretedwithoutknowingtheiranswers.4 In otherwords,relyingstrictlyonanautomaticresponsetotheregressiontablewillleadto foolishconclusions.
Intheremainderofthebook,wewillprovideabasisforacompleteinterpretationofthe numbersintheregressiontable,ofthequestionstheyaretryingtoanswer,andoftheways inwhichtheirinterpretationdependsontheassumptionswearewillingtomake.
4 ThesedataaredrawnfromafamouspaperbyFrancisAnscombe,includingfourfakedatasetsthatgiveexactly thesameregressionresultsbutwhoseplotssuggestwildlydifferentinterpretations.Thepaperis “Graphsin statisticalanalysis” from1973in AmericanStatistician
Figure1-3 Thedataunderlyingtheanalysisofthevariables y3 and x3 inthe anscombe dataset.
R andexploratorydataanalysis Keyterms: Dataframe,Exploratorydataanalysis, for() loop, R, R function,Samplemean, Scatterplot,Vector
Iprogrammyhomecomputer
Beammyselfintothefuture.
Kraftwerk, “HomeComputer”
Virtuallyallstatisticalcomputationsperformedforresearchpurposesarecarriedoutusing statisticalsoftware.Inthisbook,wewilluse R,whichisaprogramminglanguagedesigned forstatisticsanddataanalysis.
Formanystudents, R ismoredif fi culttolearnthanmostproprietarystatistical software. R usesacommand-lineinterface,whichmeansthattheuserhastowriteand entercommandsratherthanuseamousetoselectanalysesandoptionsfrommenus.1 Onceyoubecomecomfortablewith R ,youwill fi ndthatitismorepowerfulthanthe optionsthatareeasiertolearn.
Whyuse R ifitismoredifficultthansomeofthealternatives?Thereareseveralreasons:
(1)Community: R isthecomputing linguafranca ofprofessionalandacademicstatisticians andofmanydataanalystsfromother fields.Thereisanactivecommunitygenerating newcontentandanswerstoquestions.
(2)Adaptability: R userscanwritepackagesthataddto R’scapabilities.Therearethousands ofpackagesavailabletohandlespecializeddataanalysistasks,tocustomizegraphical displays,tointerfacewithotherprogramsandsoftware,orsimplytospeeduporease typicalprogrammingtasks.Someofthemarewonderful.Theadaptabilityof R means thatnewstatisticaltechniquesareavailablein R yearsorevendecadesbeforethey becomeavailableinproprietarypackages.
(3)Flexibility:Supposeyouwanttoperformastatisticalprocedurebutmodifyitslightly.In aproprietarylanguage,thisisoftendifficult thecodeusedtoruntheprocedureiskept hidden,andtheonlyparametersyouareallowedtochangearetheonesincludedas optionsandshowntotheuser.In R,itismucheasiertomakechanges.
1 Therearesomegraphicaluserinterfaces(GUIs)for R available,suchas R Commander.Anotherwaytomake R moreuser-friendlyistouseaninteractivedeveloperenvironment(IDE),suchasRStudio.Inthisbook,Iwill assumethatyouareusing R withoutaGUIorIDE,butyouarewelcometouseone,andRStudioisrecommended.
(4)Performance:Comparedwithmanyproprietarypackages, R isfasterandcanworkwith largerdatasets. R isnotthefastestlanguageavailable,butitisusuallymorethanfast enough,andwhenyoureallyneedspeed,youcanprogram R tointerfacewithfaster languageslikeC++.
(5)Easeofintegratingsimulationandanalysis:Inthisbook,weanswerseveralquestions aboutprobabilityandstatisticsbysimulation.Thatis,weaskquestionslike, “What wouldhappenifweusedsuch-and-suchestimatorondatadrawnfromsuch-and-such distribution?” Sometimes,itispossibletoanswersuchquestionsmathematically,butit isofteneasiertosimulatedataofthetypeweareinterestedin,applythetechniquewe areinterestedin,andseewhathappens. R providesaframeworkforcarryingoutthat procedure.
(6)Price: R isfreesoftwareundertheGNUGeneralPublicLicense.Themostobvious advantageisthatitwon’tcostyouanything,whereassomeoftheproprietarypackages costhundredsorthousandsofdollarsperyearforalicense.Moreimportantly, R’s community(point1)isbroaderthanitwouldotherwisebebecause R isfree.
Ihopethatyouareconvincedthatlearning R isagoodideaforanyonewithaninterestin statisticsordataanalysis.2 Inaddition,ifyouhaveneverprogrammedbefore,Iwould suggestthatlearningone’s firstprogramminglanguageisoneofthemorerewarding intellectualexperiencesonecanhave.Toprogramacomputersuccessfullyistobelogical, explicit,andcorrect fewotherpursuitsforceustotakeonthesequalitiesinthesameway andgiveussuchclearfeedbackwhenwehavefallenshort.
Wewillfocusonasubsetof R’sfeatures,includingsimulation,useof R’sbuilt-indatasets, basicanalyses,andbasicplotting.Ihopethatlearningtheseaspectsof R willmotivateyou toexploreitsmanyotherfeatures.Therearedozensofbooksandonlinetutorialsthatcan teachyoumoreabout R.Afewgoodonesarelistedattheendofthechapter.Inthe remainderofthischapter,youwillinstall R andcompleteatutorialthatwillprepareyou fortheexercisesinsubsequentchapters.Moreinformationonbasic R commandsand objecttypesisavailableinAppendixB.
ExerciseSet2-1 1)Downloadandinstallthecurrentversionof R onacomputertowhichyouhaveregular access. R isavailablefromtheComprehensive R ArchiveNetwork(CRAN)athttp://www.rproject.org.(ForLinuxusers,thebestprocedurewilldependonyourdistribution.)
2)Open R.Whenyouhavesucceeded,youshouldseeawindowwiththelicensinginformation for R andacursorreadytotakeinput.
3)(Optional,butrecommended)DownloadandinstallRStudio.Closeyouropen R sessionand openRStudio.IfyouprefertheRStudiointerface,thenuseRStudiotorun R fortherestofthe exercisesinthebook.
4)Visitthebook’sGitHubrepositoryatgithub.com/mdedge/stfs/toviewadditionalresources, including R scripts,amongwhichisascripttorunallthecodeinthischapter.
2 Idonotmeantosuggestthat R hasnodisadvantages. R cannotreturnyourlove: >Iloveyou,R Error:unexpectedsymbolin"Ilove"
2.1Interactingwith R Themostimportantthingyoucanlearnabout R ishowtogethelp.Thetwobuilt-in commandsfordoingsoare help() and help.search() .The help() commanddirects youtoinformationaboutother R commands.Forexample,typing help(mean) atthe promptandhittingreturnwillopenawebbrowserandtakeyoutoapagewithinformation abouthowtousethe mean() function,whichwewillseealittlelater.Onedownsideof help() isthatyouhavetoknowthenameofthe R functionyouwanttouse.Ifyoudon’t knowthenameofthefunctionyouneed,youcanuse help.search().Forexample,ifwe didn’tknowthat mean() wasthefunctionweneedtotakethemeanofasetofnumbers,we couldtry help.search("mean"). Thisbringsupalistoffunctionsthatmatchthequery "mean".Inthisspecificcase, help.search() isnottoohelpful alotoffunctionscome up,andtheonewewant base::mean() isburied.(Wesee “base::mean()” because mean() isinthe base package,whichloadsautomaticallywhen R isstarted.)Whenyou don’tknowthenameofthefunctionyouneed,awebsearchisusuallymorehelpful.
Asyouprogressinyouruseof R,youwill find help() tobeincreasinglyuseful.Butasyou start,youmay findthe help() pagestobehardtounderstand onehastolearnalittle about R beforetheymakesense.ThistutorialandtheinformationinAppendixBwillhelp yougetcomfortable.Afterthat,youcanswitchtousingsomeoftheresourcesattheendof thechapter.MysinglefavoriteresourceforbeginnersisRobKabacoff ’sfreewebsiteQuick-R (http://www.statmethods.net/),whichhashelpandexamplesformostofthetasksyou’ll needtoperformfrequentlyin R.Ifyouhaveaspecificquestion,searchforit;ithaslikely beenansweredonthe R forumorStackOverflow(www.stackoverflow.com/).
Onceyouhaveopened R,youwillseeapromptthatlookslikethis: >
Thesimplestwaytoview R isasaprogramthatrespondsdirectlytoyourcommands.You typeacommandfollowedbythereturnkey,and R returnsananswer.Forexample,youcan ask R todoarithmeticforyou:
>3+4
[1]7
>(9*8*7*sqrt(6))/3 [1]411.5143
Here,the “[1]” beforetheanswermeansthatyouransweristhe firstentryofa vector that R returned,whichinthiscaseisavectoroflength1. Vectors areorderedcollectionsofitems ofthesametype.Bothofthecommandsaboveare expressions oneofthetwomajortypes of R commands. R evaluatestheexpressionandprintstheanswer.Theanswerisnotsaved. Theothermajortypeof R commandis assignment,whichwewillseeshortly.
Anythingthatappearsaftera “#” signonalineistreatedasacommentandignored.So, asfaras R isconcerned,
>#ThenextlinegivesthesumIneed >2+3#2and3areimportantnumbers isthesameas
>2+3
Writinggoodcommentsisessentialtowritingreadablecode.Commentswillhelpyou understandanddebugcodeafteryouhavesetitaside.
Youcanusethe “up” and “down” arrowkeysinthe R terminaltoreturntocommands thatyouhavealreadyentered.The “up” arrowbringsupyourmostrecentlyissuedcommand;pressing “up” againbringsupthecommandissuedbeforethat,etc.Thistacticis especiallyhelpfulwhenyouneedtore-enteracommandwithamodification.
Youcanstorevaluesofvariablesin R.Forexample,ifyouwantedtoassignavariable called “x” toequal7,youwouldtype
>x<-7
Thecombinationofkeystrokes “ <” isusedtoassignvaluestovariables.Thespacesafter the x andbeforethe 7 canbeomittedwithoutaffectingthewaythecommandworks.In general, R is flexibleaboutspacesthatdonotinterruptthenamesofvariablesorfunctions. Youcanalsotype
>x=7
todothesamething.Bothoftheprevioustwocommandsare assignment commands.In contrasttothevaluesthatresultfromanexpressioncommand,thevaluesofassignments arenotprinted.Instead,theyarestoredforlateruse.
Toseethevaluecurrentlyassignedto x,youcantype
>x
[1]7
Youcanuse x incomputationsinthesamewayyouwouldusethevalueassignedto x:
>x*7 [1]49
Threeimportantnotesarisehere.First, R iscase-sensitive.Ifyoutrytoreferto “ x ” asa capital “X”,youwillberewardedwithanerrormessage:
>X*7
Error:object ‘X’ notfound
Moregenerally, R cannot findanobjectorcallafunctionwhosenamehasbeenmisspelled. Thisisafrequentsourceofvexationforbeginners,andanoccasionalonethroughoutone’s R-usinglife.If R returnserrorsyoudonotexpect,checkcarefullyfortypos.
Second,itispossibletohavecommandsthatarespreadovermultiplelines.Ifyouentera partialcommandandthenhitreturn,youwillseethetypical “ > ” promptreplacedbya “ + ” symbol.The “ + ” indicatesthat R needsmoreinputbeforeitcanevaluateacommand.Itis easytomissaclosingbracketorparenthesis,forexample,
>(1+3+5)/(2+4+6 +)
[1]0.75
Whenthe “ + ” promptappears, finishthecommand.Ifyoudon’tknowhowto finishthe command,youcangetanewpromptwiththeescapekeyonaWindowsorMacorwith ctrl-conaLinuxmachine.
Youcanseethattheparticularsofwhatyoutypeareimportant.Thoughyoucantype allyourcommandsdirectlyinto R,itismuchbettertosaveyour R commandsinaseparate file thisallowsforeasiercorrectionoftyposandquickreproductionofanalyses.When youuseatexteditortosaveyourcommands,donotincludetheprompt “>” oneachline thepromptsymbolisbuiltinto R andisnotpartoftheinput.
Aplaintexteditorwillwork fineforsavingyourcommands.Donotuseaprogramthat addsformattingtoyourtext,suchasMicrosoftWord,becausetheformattingcaninterfere withthecommandsthemselves.Manytexteditors,suchasgedit,willhelpfullyhighlight your R codetoenhancereadabilityifyousaveyour filewitha.Rextension.Therearealso interactivedevelopmentenvironmentsthatsimultaneouslyhighlightcode,trackvariables, andcanfeedyourcodestraightintothe R console.Ifyouwanttouseaninteractive developmentenvironment,RStudioisexcellent.
Ofcourse,youwillwanttouse R formorethanarithmetic.Inthenextsection,wewill workthroughsomedatasummariesandgraphicalprocedures.
2.2Tutorial:the Iris data Itistimetowritesomecode.Remembertowrite R codeinaseparatetexteditor youcan thenpasteitintothe R console.3 Yououghttoexecuteallthecodeinthissectiononyour owncomputer.Ascriptincludingallthecodeinthissection(aswellasotherscripts, includingallthecodeinthisbook)isavailableatgithub.com/mdedge/stfs/.
Inthistutorial,wewillconductsome exploratorydataanalysis ofthe R dataset iris, whichalsodiscussedinAppendixA.The iris dataframeisbuiltinto R,alongwithmany otherdatasets.4 The iris dataframeincludesasetofmeasurementson50iris flowersfrom eachofthreedifferentspecies Irissetosa, Irisvirginica,and Irisversicolor.
The iris datasetisbuiltinto R;youdonotneedtodoanythingtoinstallit.Youcansee thewholedatasetbyentering iris atthecommandline.Itisusuallymoreusefulto examinejustthe firstfewrowsofadataframe,whichyoucanseeusingthe head() function: >head(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies 15.13.51.40.2setosa 24.93.01.40.2setosa 34.73.21.30.2setosa 44.63.11.50.2setosa 55.03.61.40.2setosa 65.43.91.70.4setosa
head() isan R function.Likemathematicalfunctions(seeAppendixA), R functionstake inputs,orarguments,andreturnoutputs.Inadditiontothefunctionsbuiltinto R,thousandsmoreareavailableinadd-onpackages,andyoucanalsowriteyourownfunctions.
3 Twootherwaystotransfercodefromthetexteditorto R:(1)IfyouareusingRStudio,youcanhighlightcode writteninthesource fileandrunitbyholdingctrl(or,onaMac,cmd)andhittingthereturnkey.(2)The source() functionisanotherwaytorunallthecommandswritteninatext file.See help(source)
4 Youcanseethenamesofallthebuilt-indatasetsusing library(help="datasets") .Typically,onecan refertoabuilt-indatasetsimplybytypingitsnameinthe R console,unlessthatnamehasbeenusedfor somethingelseinthesession.Thebuilt-indatasetswillnotbelistedbythe ls() function(orshownamong theobjectsintheenvironmentinRStudio)unlesstheyareloadedwiththe data() function,asin data ("iris").
Wehaveexplicitlyspecifiedtheargument iris,indicatinginthiscasethatwewantto seethe firstfewlinesoftheobjectnamed iris.Thereisanotherargumentto head() that wecanleaveunspecifiedifweare finewiththedefaultvalues,butwhichwecouldalso changeifwewanted.Forexample,thecommand head(iris,n=10) wouldproducethe first10linesof iris,ratherthanthe first6lines.Functionargumentshavenames the equalssignin n=10 indicatesthatweareassigningtheargumentnamed n thevalue10. Thereissome flexibilityintheexplicitnamingofarguments forexample,wedidnot write x=iris eventhoughthe firstargumentof head() isnamed x.Theruleisthatifthe argumentsaregivenintheorderthefunctionexpectsthem inthecaseof head(), x first and n second thenonedoesnothavetonamethem.Forexample,these fivecommands givethesameresults
>head(iris,10)
>head(iris,n=10)
>head(x=iris,10)
>head(x=iris,n=10)
>head(n=10,x=iris)
Butiftheargumentsaregiveninanorderdifferentthan R expectsandnotnamed,youwill haveproblems.Thecall head(10,iris),forexample,givesanerror.Youcanseethe argumentsthateachfunctionexpectsinorderusing help() . iris isa dataframe,whichmeansthatitcanholdrowsandcolumnsofdata,thatthe rowsandcolumnscanbenamed,andthatdifferentcolumnscanstoredataofdifferent types(numeric,character,etc.).Here,wecanseethateachrowcontainsanindividual flower,andthecolumnscontainmeasurementsofdifferentfeaturesofeach flower.Each columnofthedataframeisavector,anorderedcollectionofdataofonetype. Wecanquicklygainsomeusefulinformationusingthe summary() function:
>summary(iris)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width Min.:4.300Min.:2.000Min.:1.000Min.:0.100 1stQu.:5.1001stQu.:2.8001stQu.:1.6001stQu.:0.300 Median:5.800Median:3.000Median:4.350Median:1.300 Mean:5.843Mean:3.057Mean:3.758Mean:1.199 3rdQu.:6.4003rdQu.:3.3003rdQu.:5.1003rdQu.:1.800 Max.:7.900Max.:4.400Max.:6.900Max.:2.500
Species setosa:50 versicolor:50 virginica:50
Ifwewouldrathernotseeresultsforthewholedatasetatonce,wecanprobetheindividual variables.Werefertoindividualvariablesinadataframebytypingthenameofthe dataframe first,then $,andthenthenameofthevariablewewanttoreference.Try entering iris$Sepal.Length intothe R terminal.Thentrygivingthatstatementasan argumentto summary(),asin
>summary(iris$Sepal.Length)
Min.1stQu.MedianMean3rdQu.Max. 4.3005.1005.8005.8436.4007.900
Ifwewouldrathernotseealltheinformationprovidedbythesummary,wecanalsoaskfor specificvalues,likethe samplemean ormedian:5
>mean(iris$Sepal.Length)
[1]5.843333
>median(iris$Sepal.Length)
[1]5.8
Histogramsprovideavisualsummaryofthedataforonevariable.
>hist(iris$Sepal.Length)
producesabasichistogramofthesepallengthdata.Youcanimproveitbychangingthe axislabelusingthe xlab argument.Forexample,thecommand >hist(iris$Sepal.Length,xlab="SepalLength",main="")
producesaplotsimilartotheoneshowninFigure2-1.(Ihavemadesomealterationstothe figureforaestheticreasons;youcanviewthecodeusedtomakeFigure2-1 andallthe other R figuresinthebook atgithub.com/mdedge/stfs/.)Noticethattheargumentsare separatedbycommasandthatwhenwewanttorefertostringsofcharactersratherthan namedvariables,weputtextinquotations.Youcanseeotheroptionsfor hist() using help(hist)
The Iris dataarewell-knownbecauseR.A.Fisher probablythetwentiethcentury’smost importantstatistician usedtheminafamousstudy.6 WhenFisherexaminedthe Iris data,
Figure2-1 Histogramsshowtheempiricaldistributionofone-dimensionaldata.Theverticalaxis showsthecount(orproportion)oftheobservationsthatfallintherangeshownonthe horizontalaxis.
5 Themeanisthearithmeticaverage,thesumoftheobservationsdividedbythenumberofobservationsinthe setbeingsummed.Amedianisanumberthatfallsatthe50thpercentileofthedata anumberthanwhichhalf oftheobservationsaregreaterandhalfoftheobservationsaresmaller.
6 Thestudy(Fisher,1936)appearedinthe AnnalsofEugenics.Itisdistressingtolearnbutimportantto acknowledgethatthedevelopmentofstatisticsinthelatenineteenthandearlytwentiethcenturieswas