Full Download Statistical thinking from scratch: a primer for scientists m. d. edge PDF DOCX

Page 1


Statistical Thinking From Scratch: A Primer For Scientists M. D. Edge

Visit to download the full and correct content document: https://ebookmass.com/product/statistical-thinking-from-scratch-a-primer-for-scientists -m-d-edge/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Statistical Modeling With R: A Dual Frequentist and Bayesian Approach for Life Scientists Pablo Inchausti

https://ebookmass.com/product/statistical-modeling-with-r-a-dualfrequentist-and-bayesian-approach-for-life-scientists-pabloinchausti/

Love from Scratch Kaitlyn Hill

https://ebookmass.com/product/love-from-scratch-kaitlyn-hill/

Love from Scratch Kaitlyn Hill

https://ebookmass.com/product/love-from-scratch-kaitlyn-hill-3/

Love from Scratch Kaitlyn Hill

https://ebookmass.com/product/love-from-scratch-kaitlyn-hill-2/

Python for Scientists (3rd Edition) John M. Stewart

https://ebookmass.com/product/python-for-scientists-3rd-editionjohn-m-stewart/

Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking 4th Edition Harvey Motulsky

https://ebookmass.com/product/intuitive-biostatistics-anonmathematical-guide-to-statistical-thinking-4th-edition-harveymotulsky/

Coral Reefs of Australia: Perspectives from Beyond the Water's Edge Sarah M. Hamylton

https://ebookmass.com/product/coral-reefs-of-australiaperspectives-from-beyond-the-waters-edge-sarah-m-hamylton/

Essential MATLAB for Engineers and Scientists Brian D. Hahn & Daniel T. Valentine

https://ebookmass.com/product/essential-matlab-for-engineers-andscientists-brian-d-hahn-daniel-t-valentine/

Pro Kotlin Web Apps from Scratch August Lilleaas

https://ebookmass.com/product/pro-kotlin-web-apps-from-scratchaugust-lilleaas/

StatisticalThinking fromScratch

StatisticalThinking fromScratch APrimerforScientists

M.D.EDGE

GreatClarendonStreet,Oxford,OX26DP, UnitedKingdom

OxfordUniversityPressisadepartmentoftheUniversityofOxford. ItfurtherstheUniversity’sobjectiveofexcellenceinresearch,scholarship, andeducationbypublishingworldwide.Oxfordisaregisteredtrademarkof OxfordUniversityPressintheUKandincertainothercountries ©M.D.Edge2019

Themoralrightsoftheauthorhavebeenasserted FirstEditionpublishedin2019

Impression:1

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedin aretrievalsystem,ortransmitted,inanyformorbyanymeans,withoutthe priorpermissioninwritingofOxfordUniversityPress,orasexpresslypermitted bylaw,bylicenceorundertermsagreedwiththeappropriatereprographics rightsorganization.Enquiriesconcerningreproductionoutsidethescopeofthe aboveshouldbesenttotheRightsDepartment,OxfordUniversityPress,atthe addressabove

Youmustnotcirculatethisworkinanyotherform andyoumustimposethissameconditiononanyacquirer

PublishedintheUnitedStatesofAmericabyOxfordUniversityPress 198MadisonAvenue,NewYork,NY10016,UnitedStatesofAmerica BritishLibraryCataloguinginPublicationData

Dataavailable

LibraryofCongressControlNumber:2019934651

ISBN978–0–19–882762–7(hbk.)

ISBN978–0–19–882763–4(pbk.)

DOI:10.1093/oso/9780198827627.001.0001

Printedandboundby CPIGroup(UK)Ltd,Croydon,CR04YY

LinkstothirdpartywebsitesareprovidedbyOxfordingoodfaithand forinformationonly.Oxforddisclaimsanyresponsibilityforthematerials containedinanythirdpartywebsitereferencedinthiswork.

3.1De

5.1Expectedvaluesandthelawoflargenumbers60

5.3Jointdistributions,covariance,andcorrelation

5.4[Optionalsection]Conditionaldistribution,expectation,andvariance72

5.5Thecentrallimittheorem

5.6Aprobabilisticmodelforsimplelinearregression

6.6[Optionalsection]Statisticaldecisiontheoryandrisk

7.1Standarderror111

7.2Confidenceintervals

7.3FrequentistinferenceI:nullhypotheses,teststatistics,and p values116

7.4FrequentistinferenceII:alternativehypothesesandtherejection framework

7.5[Optionalsection]Connectinghypothesistestsandconfidenceintervals124

7.6NHSTandtheabuseoftests

7.6.1Lackofreplication

7.6.4Identificationofscientifichypothesiswithastatisticalhypothesis126

7.6.5Neglectofothergoals,suchasestimationandprediction

7.6.6Adegradedintellectualculture

7.6.7EvaluatingsignificancetestsinlightofNHST

7.7FrequentistinferenceIII:power 131

7.8Puttingittogether:Whathappenswhenthesamplesizeincreases?135

7.9Chaptersummary 137

7.10Furtherreading 137

8Semiparametricestimationandinference

8.1Semiparametricpointestimationusingthemethodofmoments142

8.1.1Plug-inestimators143

8.1.2Themethodofmoments145

8.2Semiparametricintervalestimationusingthebootstrap149

Box8-1:Bootstrappivotalintervals 154

8.3Semiparametrichypothesistestingusingpermutationtests 157

8.4Conclusion 162

9Parametricestimationandinference

Box9-1:Logarithms

9.1Parametricestimationusingmaximumlikelihood 168

9.1.1Maximum-likelihoodestimationforsimplelinearregression172

9.2Parametricintervalestimation:thedirectapproachandFisher information 175

9.2.1Thedirectapproach 175

9.2.2[Optionalsubsection]TheFisherinformationapproach 177

9.3ParametrichypothesistestingusingtheWaldtest 180

9.3.1TheWaldtest 180

9.4[Optionalsection]Parametrichypothesistestingusingthe likelihood-ratiotest 181

9.5Chaptersummary 185

10Bayesianestimationandinference

10.1Howtochooseapriordistribution?187

10.2Theunscaledposterior,conjugacy,andsamplingfromtheposterior188

10.3BayesianpointestimationusingBayesestimators 193

10.4Bayesianintervalestimationusingcredibleintervals 196

10.5[Optionalsection]Bayesian “hypothesistesting” usingBayesfactors198

10.6Conclusion:Bayesianvs.frequentistmethods 201

10.7Chaptersummary 202 10.8Furtherreading 202

B.2Datastructuresanddataextraction

B.2.3Dataframes

B.3

B.3.1Gettinghelp

B.3.2Datacreation

B.3.3Variableinformation

B.3.4Math

B.3.5Plotting

B.3.6Optimizationandmodel

B.3.7Distributions

B.3.8Programming

B.3.9Datainputandoutput(I/O)

Acknowledgments

Whenonethinksaboutprobabilityandrandomprocesses,one’smindsometimeswanders towardthecontingenciesinherentinlife.Almostanyeventinone’sbiographymighthave happenedinaslightlydifferentway,andtheunrealizedoutcomesmighthavebranched intoamuch-alteredstory.Togiveoneexampleoutofmanyfrommyownlife,Itookaclass incollegethatshapedmycareertrajectory,withoutwhichthisbookwouldnothavebeen written.Thepersonwhotoldmetheclasswouldbeofferedwasafriendofmyroommate’s whohappenedtovisitthedaybeforetheapplicationwasdue apersonIhadnotmet beforeandhaven’tseensince.Hadmyroommate’sfriendnotvisited,Iwouldhavegoneon toliveanotherlifethatwouldhavefollowedanautumnterminwhichItookadifferent class.Immediatelymyroommate,hisfriend,theclassinstructors,andallthepeoplewho setthemonacourseinwhichtheywouldinteractwithmeareimplicatedinthewritingof thisbook.Therearethousandsofothersucheventsthatinfluencedthisbookinshallowor deepways,andeachofthoseeventswasitselfconditionalonachainofcontingencies. Takingthisview,itisnotmysticaltosaythatbillionsofpeople includingthelivingand theirancestors areconnectedtothisbookthroughawebofinteractions.Thepurposeof anacknowledgmentssection,then,isnottoseparatecontributorsfromnon-contributors, buttoproducealistofthefewpeoplewhoserolesappearmostsalientaccordingtothe faultymemoryoftheauthor.

MyeditorsatOxfordUniversityPress,IanShermanandBethanyKershaw,haveguided methroughtheprocessofimprovingand finalizingthisbook.Ihavealwaysfeltthatthe bookhasbeensafeintheirhands.MacClarke’sperceptivecopyeditsrescuedmein morethanafewplaces.TheSPitypesettingteamablyhandledproduction,managedby SaranyaJayakumar.AlexWalkerdesignedthecover,whichhaswonmemanyunearned compliments.

TimCampellone,PaulConnor,AudunDahl,CarolynFredericks,ArbelHarpak,Aaron Hirsh,SusanJohnston,EmilyJosephs,JaeheeKim,JoeLarkin,JeffLong,SandyLwi, KoshlanMayer-Blackwell,TomPamukcu,BenRoberts,MaimonRose,WilliamRyan,Rav Suri,MikeWhitlock,RasmusWinther,YeXia,AlisaZhao,severalanonymousreviewers, andmyPsychology102studentsatUCBerkeleyallreadsectionsofthebookandprovideda combinationofencouragementandconstructivecomments.Thebookisvastlyimproved becauseoftheirinput.ArbelHarpak,AaronHirsh,andSandyLwiwarrantspecialmention fortheirdetailedcommentsonlargeportionsofthetext.Mywonderfulstudentsalso deservecelebrationfortheirbraveryinbeingthe firstpeopletorelyonthisbookasa text,asdoesRavSuriforbeingthe firstinstructor(afterme)toadoptitforacourse.

ThisbookwouldhaveremainedmerelyanideaforabookifitwerenotforAaronHirsh andBenRoberts.AaronandBen alongwithmywife,IsabelEdge convincedmethat Imightbeabletowritethisbookandthatitwasnotentirelyridiculoustotry.Theywere alsobothinstrumentalinshapingthecontentandframing,andtheyguidedmeas Inavigatedthepossibilitiesforpublication.Alongsimilarlines,MelanieMitchellhelped mearriveattheschemethatkeptmewritingconsistentlyforyears requiring1,000words ofnewtextfrommyselfeachweek,withthepenaltyforderelictiona$5donationtoaproastrologyoutfit.

IwouldnothavehadtheideatowritethisbookifIhadnotbeenabletotrainasbothan empiricalresearcherandasadeveloperofstatisticalmethods.Mymentorsinthese fields includingTriciaClement,GrahamCoop,EmilyDrabant,RussFernald,JamesGross,Sheri Johnson,VivekaRamel,andNoahRosenberg madethispossible.Myinterestswere seededbywonderfulteachers,includingCariKaufman,whoseintroductoryclassinmathematicalstatisticsgavemeafeelingofempowermentthatIhavewantedtoshareeversince, RonPeet,whodidthemosttoshapemyinterestinmathandgavememy firstexposureto statistics,andBinYu,whotaughtmemorethanIthoughtpossibleinonesemesterabout workingwithdata.Gary “GH” Haneltaughtmecalculus,andIhavecribbedhissticky phrasesconveyingtherulesfordifferentiatingandintegratingpolynomials—“outinfront anddownbyone,” and “upandunder”—after findingthatthey(andseveralofhisother memoryaids)alwaysstaywithme,nomatterhowmuchcalculusIforgetandrelearn.

Finally,myfamilyhassupportedmeduringtheprocessofwriting(andofpreparingto write).Myparents,ChloeandDon,havealwayssupportedmeinmygoalsforlearningand growth.Isabel,mywife,hashelpedmebelievethatmyeffortsareworthwhileandhasbeen alistenerandcounseloroveryearsofwriting.Maceo,ourthree-year-old,hasnotprovided anyinputthatIhaveincorporatedintothetext,butwelikehimverymuchnonetheless.

Prelude

Practitionersofeveryempiricaldisciplinemustlearntoanalyzedata.Moststudents’ first andsometimesonly trainingindataanalysiscomesfromacourseofferedbytheirhome department.Insuchcourses,the firstfewweeksaretypicallyspentteachingskillsfor readingdatadisplaysandsummarizingdata.Theremainderofthecourseisspentdiscussingasequenceofstatisticalteststhatarerelevantforthe fieldinwhichpractitionersare beingeducated:acourseinapsychologydepartmentmightfocuson t -testsandanalysisof variance(ANOVA);aneconomicscoursemightdeveloplinearregressionandsomeextensionsaimedatcausalinference;futurephysiciansmightlearnaboutsurvivalanalysisand Coxmodels.Thereareatleastthreeadvantagestothisapproach.First,giventhatstudents maytakeonlyonecourseindataanalysis,itisreasonabletoteachtheskillstheyneedtobe functionaldataanalystsasquicklyaspossible.Second,coursesthatfocusonprocedures usefulinthestudents’ majorareaofstudyallowinstructorstopickrelevantexamples, encouraginginterest.Third,courseslikethesecanteachdataanalysiswhilerequiringno mathematicsbeyondarithmetic.

Atthesametime,theintroductionoftestaftertestinthesecondpartofthecoursecomes withmajordrawbacks.First,asinstructorsfrequentlyhearfromtheirstudents,test-aftertestlitaniescanbedifficulttounderstand.Thematerialthatconceptuallyunitesthe procedureshasbeensqueezedintoashorttime.Asaresult,fromthestudents’ view,each procedureisasubjectuntoitself,anditisdifficulttodevelopanintegratedviewof statisticalthinking.Second,formotivatedstudents,standardintroductorysequencescan givetheimpressionthat,thoughdatamaybeexciting,statisticsisuninteresting.Forthe student,itcanseemasthoughtomasterstatisticsistomemorizeavasttreeofassumptions andhypotheses,allowingonetodrawtheappropriatetestfromadeckwhencertain conditionsaremet.Studentswholearnthisstyleofdataanalysiscannotbeblamedfor failingtoseethatthedisciplineofstatisticsisstimulatingoreventhatitisintellectually rooted.Third,theabilitytoapplyafewwell-chosenproceduresmayallowthestudentto becomeafunctionalresearcher,butitisaninsufficientfoundationforfuturegrowthasa dataanalyst.Wehavetaughtasetofrecipes andversatile,germaneones butwehave nottrainedachef.Whennewproceduresarise,itwillbenoeasierforourstudenttolearn themthanitwasforhimtolearnhis firstsetofprocedures.Thatistosayitwillrequirereal labor,andsuccesswilldependonanexpositionwrittenbysomeonewhocantranslate statisticalwritingintothelanguageofhis field.

Mostuniversitystatisticsdepartmentstrainfuturestatisticiansdifferently.First,they requiretheirstudentstotakesubstantialcollege-levelmathbeforebeginningtheircourses instatistics.Atminimum,calculusisrequired,usuallywithmultivariablecalculusand linearalgebra,andperhapsacourseinrealanalysis.Aftermeetingthemathematical requirements,futurestatisticianstakeanentirecourse ortwo dedicatedstrictlytoprobabilitytheory,followedbyacourseinmathematicalstatistics.Afteratleastayearof

university-levelmathematicalpreparationandayearofstatisticscourses,thefuturestatisticianhasneverbeenaskedtoapply andperhapsneverevenheardof proceduresthe futurepsychologist,forexample,hasappliedmidwaythroughanintroductorycourse.

Atthisstageinherdevelopment,thewell-trainedstatisticsstudentmaynothaveapplied three-wayANOVA,forexample,butshedeeplyunderstandsthetechniquesshedoesknow, andsheseestheinterestandcoherenceofstatisticsasadiscipline.Moreover,shouldshe needtouseathree-wayANOVA,shewillbeabletolearnitquicklywithlittleornooutside assistance.

Howcanthebuddingresearcher pressedfortime,perhapsminimallytrainedinmathematics,andneedingtoapplyandinterpretavarietyofstatisticaltechniques gainsomethingofthecomfortableunderstandingandversatilityofthestatistician?Thisbook proposesthattheresearchworkeroughttolearnatleastoneprocedureindepth, “from scratch.” Thisexercisewillimpartanideaofhowstatisticalproceduresaredesigned,a flavor forthephilosophicalpositionsoneimplicitlyassumeswhenapplyingstatisticsinresearch, andaclearersenseofthestrengthsandweaknessesofstatisticaltechniques.

Thoughitcannotturnanon-statisticianintoastatistician,thisbookwillprovidea glimpseoftheconceptualframeworkinwhichstatisticiansaretrained,addingdepthand interesttowhatevertechniquesthereaderalreadyknowshowtoapply.Itisperhapsmost naturallyusedasamainorsupplementarytextinanadvancedintroductorycourse for beginninggraduatestudentsoradvancedundergraduates,forexample orforasecond courseindataanalysis.Iassumethereaderalreadyhasaninterestinunderstandingthe reasoningunderlyingstatisticalprocedures,asenseoftheimportanceoflearningfrom data,andsomefamiliaritywithbasicdatadisplaysanddescriptivestatistics.Priorexposure tocalculusandprogrammingarehelpfulbutnotrequired themostrelevantconceptsare introducedbrieflyinChapter2andinAppendicesAandB.Probabilitytheoryistaughtas neededandnotassumed.Insomedepartments,thebookwouldbesuitablefora first course,butthemathematicaldemandsarehighenoughthatinstructorsmay findthey prefertousethebookwithdeterminedstudentswhoarealreadyinvestedinempirical research.Anotherpossibleadjustmentistosplitthecourseintotwoterms,withthe Interludeservingasapreludetothesecondcourse,supplementingwithdataexamples fromthestudents’ field.Thebookisalsousefulasaself-studyguideforworkingresearchers seekingtoimprovetheirunderstandingofthetechniquestheyusedaily,orforprofessionalswhomustinterpretresultsfromresearchstudiesaspartoftheirwork.

Therearemanyexcellentstatisticstextbooksavailablefornon-statisticians,soanynew bookmustmakeplainhowitdiffersfromothers.Thisbookhasasetoffeaturesthatarenot universalandthatincombinationmightbeunique.

First,thisbookfocusesoninstructioninexactlyonestatisticalprocedure,simplelinear regression.Theideaisthatbylearningoneprocedurefromscratch,consideringtheentire conceptualframeworkunderlyingestimationandinferenceinthisonecontext,onecan gaintools,understanding,andintuitionthatwillapplytoothercontexts.Inaneraofbig data,wearekeepingthedatasmall twovariables,andinthedatasetweusemostoften, only11observations andthinkinghardaboutit.Insayingthatwework, “fromscratch,” Imeanthatweattempttotakelittleforgranted,exploringasmanyfundamentalquestions aspossiblewithacombinationofmath,simulations,thoughtexperiments,andexamples. Ichosesimplelinearregressionastheproceduretoanalyzebothbecauseitismathematically simpleandbecausemanyofthemostwidelyappliedstatisticaltechniques including ttests,multipleregression,andANOVA,aswellasmachine-learningmethodslikelassoand ridgeregression canbeviewedasspecialcasesorgeneralizationsofsimplelinearregression.Afewofthesegeneralizationsaresketchedorexemplifiedinthe finalchapter.

Asecondfeatureisthemathematicallevelofthebook,whichisgentlerthanmosttexts intendedforstatisticiansbutmoredemandingthanmostintroductorystatisticstextsfor non-statisticians.Onegoalistoserveasabridgeforstudentswhohaverealizedthatthey needtolearntoreadmathematicalcontentinstatistics.Learningtoreadsuchcontent unlocksaccesstoadvancedtextbooksandcourses,anditalsomakesiteasiertotalkwith statisticianswhenadviceisneeded.AsecondreasonforincludingasmuchmathasIhaveis toincreasethereader’sinterestinstatistics manyoftherichestideasinstatisticsare expressedmathematically,andifonedoesnotengagewithsomemath,itistooeasyfora statisticscoursetobecomealistofprescriptions.Themainmathematicalrequirementis comfortwithalgebra(or,atleast,neglectedalgebraskillsthatcanbedustedoffandputback intoservice).Thebookrequiressomefamiliaritywiththemainideasofcalculus,whichare introducedbrieflyinAppendixA,butitdoesnotrequiremuchfacilitywithcalculus problems.Themathematicaldemandsincreaseasthebookprogresses(uptoabout Chapter9),onthetheorythatthereadergainsskillsandconfidencewitheachchapter. Nearlyallequationsareaccompaniedbyexplanatoryprose.

Third,someoftheproblemsinthisbookareanintegralpartofthetext.Themajorityof theincludedproblemsareintendedtobeearnestlyattemptedbythereader exceptions aremarkedasoptional.Everysolutionisincluded,eitherinthebackofthebookoratthe book’sGitHubrepository,github.com/mdedge/stfs/,orcompanionsite,www.oup.co.uk/ companion/edge.1 Theproblemsareinterspersedthroughthetextitselfandarepartofthe exposition,providingpractice,proofsofkeyprinciples,importantcorollaries,andpreviews oftopicsthatcomelater.Manyoftheproblemsaredifficult,andstudentsshouldnotfeel theyarefailingifthey findthemtough theprocessofmakinganattemptandthen studyingthesolutionsismoreimportantthangettingthecorrectanswers.Agoodapproachforstudentsistospendroughly75%oftheirreadingtimeattemptingtheexercises, referringtothesolutionfornextstepswhenstuckformorethanafewminutes.

Fourth,severaloftheproblemsarecomputationalexercisesinthefreestatisticalsoftware package R.Manyoftheseproblemsinvolvetheanalysisofsimulateddata.Therearetwo reasonsforthis.First, R isthestatisticallanguageofchoiceamongstatisticians.Asofthis writing,itisthemostversatilestatisticalsoftwareavailable,andaspiringdataanalysts oughttogainsomecomfortinit.Secondly,itispossibletoanswerdifficultstatistical questionsin R usingsimulations.Whenoneispracticingstatistics,oneoftenencounters questionsthatarenotreadilyansweredbythemathematicsortextswithwhichoneis familiar.Whenthishappens,simulationoftenprovidesaserviceableresolution.Readersof thisbookwillusesimulationinwaysthatsuggestusefulanswers.All R codetoconductthe demonstrationsinthetext,completetheexercises,andmakethe figuresisavailableatthe book’sGitHubrepository,github.com/mdedge/stfs/,andthereisalsoan R packageincludingfunctionsspecifictothebook(installationinstructionscomeattheendofChapter2,in ExerciseSet2-2,Problem3).

Thechaptersareintendedtobereadinsequence.Jointly,theyintroducetwoimportant usesofstatistics:estimation,whichistousedatatoguessthevaluesofparametersthat describeadata-generatingprocess,andinference,whichistotesthypothesesaboutthe processesthatmighthavegeneratedanobserveddataset.(Athirdkeyapplication,prediction,isaddressedbrieflyinthePostlude.)Bothestimationandinferencerelyontheidea thattheobserveddataarerepresentativeofmanyotherpossibledatasetsthatmighthave

1 InstructorscanobtainseparatequestionsIhaveusedforhomeworkandexamswhenteachingfromthis book.Emailtheauthortorequestadditionalquestions.

beengeneratedbythesameprocess.Statisticiansformalizethisideausingprobability theory,andwewillstudyprobabilitybeforeturningtoestimationandinference.

Chapter1presentssomemotivatingquestions.Chapter2isatutorialonthestatistical softwarepackage R .(AdditionalbackgroundmaterialisinAppendicesAandB,with AppendixAdevotedtocalculusandAppendixBtocomputerprogrammingand R.) Chapter3introducestheideaofsummarizingdatabydrawingaline.Chapters4and5 coverprobability.AnInterludechaptermarksthetraditionaldividebetweenprobability theoryandstatistics.Chapter6coversestimation,immediatelyputtingthefourthand fifth chapterstowork.Chapter7coversinference:whatkindsofstatementscanwemakeabout theworldgivenamodel(Chapters4and5)andasample?Chapters8,9,and10describe threebroadapproachestoestimationandinference.Thesethreeperspectivessharegoals andsharetheframeworkofprobabilitytheory,buttheymakedifferentassumptions.Inthe Postlude,Idiscusssomeextensionsofsimplelinearregressionandpointoutafewpossible directionsforfuturelearning.

CHAPTER1

Encounteringdata

Keyterms: Computation,Data,Statistics,Simplelinearregression, R.

Ifwetakeinourhandanyvolume,...letusask,Doesitcontainanyabstract reasoningconcerningquantityornumber?No.Doesitcontainanyexperimentalreasoningconcerningmatteroffactandexistence?No.Commitit thentothe flames:foritcancontainnothingbutsophistryandillusion.

(1748)

Inthepassagequotedhere,Humeclaimsthattherearetwotypesofargumentsweshould consideraccepting:mathematicalreasoningandempiricalreasoning reasoningbasedon observationsabouttheworld.ThespecificsofHume’sclaimsarebeyondourscope,but usersofstatisticsaresafefromfollowersofHume’sdictum.Thisbookwillcontainsome reasoningconcerningnumber,anditwillcontainexamplesof “experimental” reasoning aboutobservedfacts.Thus,agoodHumeanneednotcommitittothe flames.

Hume’sstatementgivesusawayofthinkingaboutthesubjectofstatistics.Wewantto makeclaimsabouttheworldonthebasisofobservations.Forexample,wemightwantto knowwhetheracorollaryofthewavetheoryoflightmatchestheresultsofanexperiment. Wemightaskwhetheranewtherapyfordiabetesiseffective.Wemightwanttoknow whetherpeoplewithcollegedegreesearnmorethantheirpeerswhodonotgraduate. Whetherwearepursuingphysicalscience,biologicalscience,socialscience,engineering, medicine,orbusiness,weconstantlyneedanswerstoquestionswithempiricalcontent.

ButwhatisHume’s “reasoningconcerningmatteroffact”?Collectingdataaboutthe worldisonething,butusingthosedatatomakeconclusionsisanother.ConsiderFigure1-1, whichpurportstoshowdataonfertilizerconsumptionandcerealyieldsin11sub-Saharan Africancountries.1 Inmanysub-SaharanAfricancountries,soilnitrogenlimitsagricultural yield.Onewaytoincreasesoilnitrogenistoapplyfertilizer.Eachofthe11countriesinthe datasetisrepresentedbyapointontheplot.Foreachpoint,the x coordinate thatis,the positiononthehorizontalaxis indicatesthecountry’sfertilizerconsumptioninagiven year.The y coordinate thepositionontheverticalaxis representsthecountry’syieldof cerealgrainsinthatsameyear.SupposethatonthebasisofFigure1-1,Iclaimthatthereisa robustrelationshipbetweenfertilizerconsumptionandcerealyieldamongcountriessimilar tothe11countrieswehavesampled.(NotethatIhavenotmadeclaimsaboutanycausal

1 Thesedataarefake moreontheirsourceattheendofthechapter butthequestionisreal.Thedatain Figure1-1dolooselyresembleactualdatafromsomesub-SaharanAfricancountrieswithlowgrainyieldsfrom 2008to2010.Forexample,in2010,farmersinMozambiqueconsumedabout9kg/hectareoffertilizer,andthe cerealyieldwasabout945kg/hectare.ActualdataareavailablefromtheWorldBank.

StatisticalThinkingfromScratch:APrimerforScientists.M.D.Edge,OxfordUniversityPress(2019). ©M.D.Edge2019.DOI:10.1093/oso/9780198827627.001.0001

relationshipsthatmightexplaintherelationship;Ihavemerelypositedthattherelationship “exists” insomesense.)

Supposeyoudisagree.Youmightcounterthattheplotisn’timpressive:therearen’t manydata,therelationshipbetweenthevariablesstrikesyouasweak,andyoudon’t knowthesourceofthedata.Thesecountersareatleastpotentiallylegitimate.Youand Iareatanimpasse,DearReader:Ihavemadeaclaimbasedondata,andyouhavelookedat thesamedataandmadeadifferentclaim.Withoutmethodsforreasoningaboutdata,itis unclearhowtomakefurtherprogressregardingourdisagreement.

Howcanwedevelopconceptsforreasoningfromdata?Thedisciplineofstatistics providesoneanswertothisquestion.StatisticstakesHume’sothercandidatefornonillusoryknowledge mathematicalreasoning andbuildsamathematicalframeworkin whichwecansetthedata.2 Oncewehaveframedtheproblemofreasoningaboutdata mathematically,wecanmakeclaimsbyadoptingassumptionsandthenusingmathematicalreasoningtoproceed.Thestatusoftheclaimswemakewillusuallyhingeonthe appropriatenessoftheassumptionsweusetogetstarted.Asyoureadaboutstatistical approachesforreasoningaboutdata,considerwhetherandunderwhatcircumstances theyareadequate.Wewillrevisitthisquestioninvariousforms.

InthePrelude,Ipromisedthatthisbookwouldbeonlylightlymathematical,yethere, Ihaveproposedthatstatisticsisawaytoharnessmathematicalthinkingtoreasonabout data.Howwillthisbookhelpreadersstrengthentheirstatisticalunderstandingwithout engaginginheavymathematics?

Statisticstookshapeasadisciplinebeforemoderncomputerswereavailable,withmany oftheideasmostimportanttothisbookappearinginthelatenineteenthandearly twentiethcenturies.Manyofthemostimportantstatisticiansofthiserawerewell-trained mathematicians.Withlimitedcomputingpowerbutamplemathematicaltraining, theyapproachedthedevelopmentoftheirsubjectmathematically.Today,advancesin computingallowthoseofuswithlimitedmathematicaltrainingtoanswerquestionsthat

2 Thisisnottosuggestthatstrongmathematicalreasoningskillsarethesoleorevenmostimportant qualificationofadataanalyst.Facilitywithdataandcomputers,subject-areaknowledge,scientificacumen, andcommonsenseareallimportant.

Fertilizerconsumption(kg/hectare) Cereal
Figure1-1 Fertilizerconsumptionandcerealyieldin11sub-SaharanAfricancountries.

wouldbedifficultevenforaseasonedmathematiciantoapproachdirectly.Wewilluse computationtoanswerstatisticalquestionsthatwillnotyieldtoelementarymath.

1.1Thingstocome

Beforewebegininearnest,let’stakeamomenttoanticipatethemajortopicswe’llconsider intherestofthebook,motivatedbythedatainFigure1-1.Wewillbefocusedon understandingsimplelinearregression,whichentailsidentifyingalinethat “fits” the data,passingthoughthecloudofdatapointsinthe figure.Linearregression including bothsimplelinearregressionanditsgeneralization,multipleregression isperhapsthe mostwidelyusedmethodinappliedstatistics,especiallywhenitsspecialcasesareconsidered,including t-tests,correlationanalysis,andanalysisofvariance(ANOVA).Itonly takesafewcommandstorunasimplelinearregressionanalysisinthestatisticalsoftware R. (Atutorialin R iscominginthenextchapter.)Thedataarestoredinan R objectcalled anscombe,andto fitthelinearregressionmodel,werun

mod.fit<-lm(y1~x1,data=anscombe)

Aswillbediscussedlater,the lm() function fitsalinearregressionmodeltoadataset.By “fitting” aregressionmodel,we findalinethat “best” fitsthedatashowninFigure1-1.You canseethedatafromFigure1-1withthe “best fit” linedrawninbytyping plot(anscombe$x1,anscombe$y1) abline(mod. fit)

Theplotwiththelinedrawnin(plusafewimprovementstolabelingandaesthetics3) isshowninFigure1-2.The plot() functionproducesascatterplot thatis,aplot withpointslocatedtoindicatevaluesoftheattributesrepresentedbythe x and y axes.

Figure1-2 TheagriculturedatafromFigure1-1,withthelineof “ best ” fi tfromthesimplelinear regressionmodel.

3 Codeforgeneratingallthebook’s R figuresisavailableatgithub.com/mdedge/stfs.

The abline() functiondrawsthelineimpliedbythelinearregressionmodel.Thesensein whichthislinecanbedescribedasthe “best” fitisthesubjectofChapter3.Aswewillsee, thereareactuallymanydifferentlinesthatcouldbedescribedas “best.” Thelinein Figure1-2isbestaccordingtoacriterionthathasalonghistoryinstatistics.

Thelinethat’sdrawninFigure1-2hasanequation,meaningthatitcanbedescribedas y = a + bx,where a and b areconstantnumbers.Inwords,to findthe y coordinateoftheline atanyvalue x,onestartswith a andthenaddstheproductof b and x.Thevaluesof a and b forthepicturedlinearegivenintheoutputofthe summary() function: summary(mod. fit)

Theoutputis

Call:

lm(formula=y1~x1,data=anscombe)

Residuals: Min1QMedian3QMax

-1.92127-0.45577-0.041360.709411.83882

Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)3.00011.12472.6670.02573* x10.50010.11794.2410.00217**

Signif.codes:0'***'0.001'**'0.01'*'0.05'.'0.1''1

Residualstandarderror:1.237on9degreesoffreedom MultipleR-squared:0.6665,AdjustedR-squared:0.6295 F-statistic:17.99on1and9DF,p-value:0.00217

Thekeypartoftheoutputistheregressiontable,whichisprintedinboldhere.Inthe first columnofthetable,labeled “Estimate,” weseethenumbers3and0.5.Thesearethevalues oftheintercept, a,andslope, b,ofthelineinFigure1-2.Theword “estimate” suggestsaway ofviewingthelineinFigure1-2thatisdifferentfromtheonesuggestedearlier.Iinitially suggestedthatthelineinFigure1-2istheonethat “best” fitsthedatainsomesense in otherwords,thatitisadescriptionofthesample.That’strue.Theword “estimate” suggests thatitis also aguessaboutsomeunknownquantity,perhapsaboutapropertyofalarger populationorprocessthatthesampleissupposedtoreflect.Ifweassumethatdataare generatedbyaparticularprocess,thenwecanmakeclaimsaboutthetypesofdatathat mightresult.ThatisthetopicofChapters4and5,onprobabilitytheory.Andoncewehave decidedthatwereallydowanttomakeestimates thatis,tousedatatolearntheparametersofanassumedunderlyingprocess howshouldwedesignproceduresforestimation? Whatpropertiesshouldtheseprocedureshave?ThatisthesubjectofChapter6.

Movingrightwardinthetable,weseeacolumnlabeled “Std.Error,” anabbreviationfor “standarderror.” Thestandarderrorisanattempttoquantifytheprecisionofanestimate. Inaspecificsensedevelopedlater,thestandarderrorrespondstothequestion, “Ifwewere tosampleanotherdatasetfromthesamepopulationasthis,byabouthowmuchmightwe expecttheestimatetovary?” Sothe0.1179inthetablesuggeststhatifwedrewanother sampleof11pointsgeneratedbythesameunderlyingprocess intheexample,perhaps agriculturaldatafrom11othersub-SaharanAfricancountries weshouldnotbesurprised iftheslopeofthebest-fitlinediffersby~0.12fromthecurrentestimate.Theattemptto identifytheprecisionofestimatesiscalled “intervalestimation,” anditisoneofthetopics ofChapter7.

The finalcolumnoftheregressiontableislabeled “Pr(>|t|);” thesenumbersarecalled “p values.” Theirinterpretationissubtleandoftenbotched.Loosely, p valuesmeasurethe plausibilityofthedataunderaspecifichypothesis.Thehypothesesbeingtestedhereare thatthedatawereactuallygeneratedbyaprocessdescribedbyalinewithintercept0(first row)orslope0(secondrow).Low p values,liketheonesinthetable,suggestthat(a)the hypothesesarefalse,or(b)someotherassumptionentailedbythehypothesistestiswrong, or(c)anunlikelyeventoccurred.Hypothesistestingistheothersubject besidesinterval estimation ofChapter7.

Ihavealludedto “underlying” assumptions,andthegoalofmuchoftherestofthebook istoillustratethewaysinwhichsuchmodelingassumptionsareinvolvedinstatistical analysis.Dependingontheassumptionsthatthedataanalystcanjustify,differentsetsof statisticalproceduresbecomeavailable.InChapters8,9,and10,weapplydifferentsetsof assumptionstothedatasettoarriveatdifferentproceduresforpointestimation,interval estimation,andhypothesistesting.Theassumptionsunderlyingthestandardregression tableproducedby lm() arethesameasthoseusedinChapter9.InthePostlude,weconsider waysinwhichtheprinciplesdevelopedinthebookapplytostatisticalanalysesofothersorts ofdatasets,onesthatarenotanatural fitforsimplelinearregressionanalysis.

I’dliketoclosethischapterwithanargumentforundergoingsuchanextendedmeditationonsimplelinearregression.Afterall,empiricalresearchersarebusy,andit’spossible toteachheuristicinterpretationsofstatisticslikethoseintheregressiontable.Such teachingisquick,anditallowsresearcherstogettoserviceableanswers,atleastinthe easiestcases.Whyspendsomuchtimeonstatisticalthinking?Whynotoutsourcethe theorytoprofessionalstatisticians?

Theanswerisparthoney,partvinegar.Onthepositiveside,dataanalysisismuchmore funandinterestingwhentheanalysthasagenuinesenseofwhatsheisdoing.Withsome understandingofstatisticaltheory,it’spossibletorelatescientificclaimstotheirempirical basis,connectingthedatatothemathematicalframeworkthatjustifiestheclaim.That breedsconfidenceinresearchers,anditalsoallowsforcreativity.Ifyouknowhowthe machineworks,youcantakeitapartandrepurposeit.

Incontrast,roteapproachesthatrelyonheuristicsalonecanbeunsatisfying,anxietyinducing,creativelystifling,and/orgenuinelydangerous.Hereisacautionaryanecdote. Supposeyou fitalinearmodel,aswedidbefore,toanewdataset.Whereasbeforewe fita simplelinearregressiontothevariables y1 and x1 inthe anscombe dataset,wenowwork withvariables y3 and x3:

mod.fit2<-lm(y3~x3,data=anscombe) summary(mod. fit2)

Theresultingregressiontableincludes

Coefficients: EstimateStd.ErrortvaluePr(>|t|) (Intercept)3.00251.12452.6700.02562* x30.49970.11794.2390.00218**

Eachentryinthetableisnearlyidenticaltoitscounterparttablefortheearliermodel.Fora rotedataanalystrelyingonjustthemodelsummary,theinterpretationofthesemodels wouldthereforebethesame.Butlookatthedataunderlyingthisanalysis,shownin Figure1-3.WhereasthedatainFigure1-2seemtoberandomlyscatteredaroundtheline, pointsinFigure1-3appeartofollowamuchmoresystematicpattern,withtheexceptionof

onepointthatdepartsfromit.WhereasthelineinFigure1-2seemslikeanappropriate summaryofthedata,Figure1-3suggestsaneedforseriousinspection.Whatiswiththat outlyingpoint?Whyaretheothersarrangedinaperfectlystraightline?Thesequestionsare urgent,andtheregressiontablecan’tbeinterpretedwithoutknowingtheiranswers.4 In otherwords,relyingstrictlyonanautomaticresponsetotheregressiontablewillleadto foolishconclusions.

Intheremainderofthebook,wewillprovideabasisforacompleteinterpretationofthe numbersintheregressiontable,ofthequestionstheyaretryingtoanswer,andoftheways inwhichtheirinterpretationdependsontheassumptionswearewillingtomake.

4 ThesedataaredrawnfromafamouspaperbyFrancisAnscombe,includingfourfakedatasetsthatgiveexactly thesameregressionresultsbutwhoseplotssuggestwildlydifferentinterpretations.Thepaperis “Graphsin statisticalanalysis” from1973in AmericanStatistician

Figure1-3 Thedataunderlyingtheanalysisofthevariables y3 and x3 inthe anscombe dataset.

R andexploratorydataanalysis

Keyterms: Dataframe,Exploratorydataanalysis, for() loop, R, R function,Samplemean, Scatterplot,Vector

Iprogrammyhomecomputer

Beammyselfintothefuture.

Kraftwerk, “HomeComputer”

Virtuallyallstatisticalcomputationsperformedforresearchpurposesarecarriedoutusing statisticalsoftware.Inthisbook,wewilluse R,whichisaprogramminglanguagedesigned forstatisticsanddataanalysis.

Formanystudents, R ismoredif fi culttolearnthanmostproprietarystatistical software. R usesacommand-lineinterface,whichmeansthattheuserhastowriteand entercommandsratherthanuseamousetoselectanalysesandoptionsfrommenus.1 Onceyoubecomecomfortablewith R ,youwill fi ndthatitismorepowerfulthanthe optionsthatareeasiertolearn.

Whyuse R ifitismoredifficultthansomeofthealternatives?Thereareseveralreasons:

(1)Community: R isthecomputing linguafranca ofprofessionalandacademicstatisticians andofmanydataanalystsfromother fields.Thereisanactivecommunitygenerating newcontentandanswerstoquestions.

(2)Adaptability: R userscanwritepackagesthataddto R’scapabilities.Therearethousands ofpackagesavailabletohandlespecializeddataanalysistasks,tocustomizegraphical displays,tointerfacewithotherprogramsandsoftware,orsimplytospeeduporease typicalprogrammingtasks.Someofthemarewonderful.Theadaptabilityof R means thatnewstatisticaltechniquesareavailablein R yearsorevendecadesbeforethey becomeavailableinproprietarypackages.

(3)Flexibility:Supposeyouwanttoperformastatisticalprocedurebutmodifyitslightly.In aproprietarylanguage,thisisoftendifficult thecodeusedtoruntheprocedureiskept hidden,andtheonlyparametersyouareallowedtochangearetheonesincludedas optionsandshowntotheuser.In R,itismucheasiertomakechanges.

1 Therearesomegraphicaluserinterfaces(GUIs)for R available,suchas R Commander.Anotherwaytomake R moreuser-friendlyistouseaninteractivedeveloperenvironment(IDE),suchasRStudio.Inthisbook,Iwill assumethatyouareusing R withoutaGUIorIDE,butyouarewelcometouseone,andRStudioisrecommended.

(4)Performance:Comparedwithmanyproprietarypackages, R isfasterandcanworkwith largerdatasets. R isnotthefastestlanguageavailable,butitisusuallymorethanfast enough,andwhenyoureallyneedspeed,youcanprogram R tointerfacewithfaster languageslikeC++.

(5)Easeofintegratingsimulationandanalysis:Inthisbook,weanswerseveralquestions aboutprobabilityandstatisticsbysimulation.Thatis,weaskquestionslike, “What wouldhappenifweusedsuch-and-suchestimatorondatadrawnfromsuch-and-such distribution?” Sometimes,itispossibletoanswersuchquestionsmathematically,butit isofteneasiertosimulatedataofthetypeweareinterestedin,applythetechniquewe areinterestedin,andseewhathappens. R providesaframeworkforcarryingoutthat procedure.

(6)Price: R isfreesoftwareundertheGNUGeneralPublicLicense.Themostobvious advantageisthatitwon’tcostyouanything,whereassomeoftheproprietarypackages costhundredsorthousandsofdollarsperyearforalicense.Moreimportantly, R’s community(point1)isbroaderthanitwouldotherwisebebecause R isfree.

Ihopethatyouareconvincedthatlearning R isagoodideaforanyonewithaninterestin statisticsordataanalysis.2 Inaddition,ifyouhaveneverprogrammedbefore,Iwould suggestthatlearningone’s firstprogramminglanguageisoneofthemorerewarding intellectualexperiencesonecanhave.Toprogramacomputersuccessfullyistobelogical, explicit,andcorrect fewotherpursuitsforceustotakeonthesequalitiesinthesameway andgiveussuchclearfeedbackwhenwehavefallenshort.

Wewillfocusonasubsetof R’sfeatures,includingsimulation,useof R’sbuilt-indatasets, basicanalyses,andbasicplotting.Ihopethatlearningtheseaspectsof R willmotivateyou toexploreitsmanyotherfeatures.Therearedozensofbooksandonlinetutorialsthatcan teachyoumoreabout R.Afewgoodonesarelistedattheendofthechapter.Inthe remainderofthischapter,youwillinstall R andcompleteatutorialthatwillprepareyou fortheexercisesinsubsequentchapters.Moreinformationonbasic R commandsand objecttypesisavailableinAppendixB.

ExerciseSet2-1

1)Downloadandinstallthecurrentversionof R onacomputertowhichyouhaveregular access. R isavailablefromtheComprehensive R ArchiveNetwork(CRAN)athttp://www.rproject.org.(ForLinuxusers,thebestprocedurewilldependonyourdistribution.)

2)Open R.Whenyouhavesucceeded,youshouldseeawindowwiththelicensinginformation for R andacursorreadytotakeinput.

3)(Optional,butrecommended)DownloadandinstallRStudio.Closeyouropen R sessionand openRStudio.IfyouprefertheRStudiointerface,thenuseRStudiotorun R fortherestofthe exercisesinthebook.

4)Visitthebook’sGitHubrepositoryatgithub.com/mdedge/stfs/toviewadditionalresources, including R scripts,amongwhichisascripttorunallthecodeinthischapter.

2 Idonotmeantosuggestthat R hasnodisadvantages. R cannotreturnyourlove: >Iloveyou,R Error:unexpectedsymbolin"Ilove"

2.1Interactingwith R

Themostimportantthingyoucanlearnabout R ishowtogethelp.Thetwobuilt-in commandsfordoingsoare help() and help.search() .The help() commanddirects youtoinformationaboutother R commands.Forexample,typing help(mean) atthe promptandhittingreturnwillopenawebbrowserandtakeyoutoapagewithinformation abouthowtousethe mean() function,whichwewillseealittlelater.Onedownsideof help() isthatyouhavetoknowthenameofthe R functionyouwanttouse.Ifyoudon’t knowthenameofthefunctionyouneed,youcanuse help.search().Forexample,ifwe didn’tknowthat mean() wasthefunctionweneedtotakethemeanofasetofnumbers,we couldtry help.search("mean"). Thisbringsupalistoffunctionsthatmatchthequery "mean".Inthisspecificcase, help.search() isnottoohelpful alotoffunctionscome up,andtheonewewant base::mean() isburied.(Wesee “base::mean()” because mean() isinthe base package,whichloadsautomaticallywhen R isstarted.)Whenyou don’tknowthenameofthefunctionyouneed,awebsearchisusuallymorehelpful.

Asyouprogressinyouruseof R,youwill find help() tobeincreasinglyuseful.Butasyou start,youmay findthe help() pagestobehardtounderstand onehastolearnalittle about R beforetheymakesense.ThistutorialandtheinformationinAppendixBwillhelp yougetcomfortable.Afterthat,youcanswitchtousingsomeoftheresourcesattheendof thechapter.MysinglefavoriteresourceforbeginnersisRobKabacoff ’sfreewebsiteQuick-R (http://www.statmethods.net/),whichhashelpandexamplesformostofthetasksyou’ll needtoperformfrequentlyin R.Ifyouhaveaspecificquestion,searchforit;ithaslikely beenansweredonthe R forumorStackOverflow(www.stackoverflow.com/).

Onceyouhaveopened R,youwillseeapromptthatlookslikethis: >

Thesimplestwaytoview R isasaprogramthatrespondsdirectlytoyourcommands.You typeacommandfollowedbythereturnkey,and R returnsananswer.Forexample,youcan ask R todoarithmeticforyou:

>3+4

[1]7

>(9*8*7*sqrt(6))/3 [1]411.5143

Here,the “[1]” beforetheanswermeansthatyouransweristhe firstentryofa vector that R returned,whichinthiscaseisavectoroflength1. Vectors areorderedcollectionsofitems ofthesametype.Bothofthecommandsaboveare expressions oneofthetwomajortypes of R commands. R evaluatestheexpressionandprintstheanswer.Theanswerisnotsaved. Theothermajortypeof R commandis assignment,whichwewillseeshortly.

Anythingthatappearsaftera “#” signonalineistreatedasacommentandignored.So, asfaras R isconcerned,

>#ThenextlinegivesthesumIneed >2+3#2and3areimportantnumbers isthesameas

>2+3

Writinggoodcommentsisessentialtowritingreadablecode.Commentswillhelpyou understandanddebugcodeafteryouhavesetitaside.

Youcanusethe “up” and “down” arrowkeysinthe R terminaltoreturntocommands thatyouhavealreadyentered.The “up” arrowbringsupyourmostrecentlyissuedcommand;pressing “up” againbringsupthecommandissuedbeforethat,etc.Thistacticis especiallyhelpfulwhenyouneedtore-enteracommandwithamodification.

Youcanstorevaluesofvariablesin R.Forexample,ifyouwantedtoassignavariable called “x” toequal7,youwouldtype

>x<-7

Thecombinationofkeystrokes “ <” isusedtoassignvaluestovariables.Thespacesafter the x andbeforethe 7 canbeomittedwithoutaffectingthewaythecommandworks.In general, R is flexibleaboutspacesthatdonotinterruptthenamesofvariablesorfunctions. Youcanalsotype

>x=7

todothesamething.Bothoftheprevioustwocommandsare assignment commands.In contrasttothevaluesthatresultfromanexpressioncommand,thevaluesofassignments arenotprinted.Instead,theyarestoredforlateruse.

Toseethevaluecurrentlyassignedto x,youcantype

>x

[1]7

Youcanuse x incomputationsinthesamewayyouwouldusethevalueassignedto x:

>x*7 [1]49

Threeimportantnotesarisehere.First, R iscase-sensitive.Ifyoutrytoreferto “ x ” asa capital “X”,youwillberewardedwithanerrormessage:

>X*7

Error:object ‘X’ notfound

Moregenerally, R cannot findanobjectorcallafunctionwhosenamehasbeenmisspelled. Thisisafrequentsourceofvexationforbeginners,andanoccasionalonethroughoutone’s R-usinglife.If R returnserrorsyoudonotexpect,checkcarefullyfortypos.

Second,itispossibletohavecommandsthatarespreadovermultiplelines.Ifyouentera partialcommandandthenhitreturn,youwillseethetypical “ > ” promptreplacedbya “ + ” symbol.The “ + ” indicatesthat R needsmoreinputbeforeitcanevaluateacommand.Itis easytomissaclosingbracketorparenthesis,forexample,

>(1+3+5)/(2+4+6 +)

[1]0.75

Whenthe “ + ” promptappears, finishthecommand.Ifyoudon’tknowhowto finishthe command,youcangetanewpromptwiththeescapekeyonaWindowsorMacorwith ctrl-conaLinuxmachine.

Youcanseethattheparticularsofwhatyoutypeareimportant.Thoughyoucantype allyourcommandsdirectlyinto R,itismuchbettertosaveyour R commandsinaseparate file thisallowsforeasiercorrectionoftyposandquickreproductionofanalyses.When youuseatexteditortosaveyourcommands,donotincludetheprompt “>” oneachline thepromptsymbolisbuiltinto R andisnotpartoftheinput.

Aplaintexteditorwillwork fineforsavingyourcommands.Donotuseaprogramthat addsformattingtoyourtext,suchasMicrosoftWord,becausetheformattingcaninterfere withthecommandsthemselves.Manytexteditors,suchasgedit,willhelpfullyhighlight your R codetoenhancereadabilityifyousaveyour filewitha.Rextension.Therearealso interactivedevelopmentenvironmentsthatsimultaneouslyhighlightcode,trackvariables, andcanfeedyourcodestraightintothe R console.Ifyouwanttouseaninteractive developmentenvironment,RStudioisexcellent.

Ofcourse,youwillwanttouse R formorethanarithmetic.Inthenextsection,wewill workthroughsomedatasummariesandgraphicalprocedures.

2.2Tutorial:the Iris data

Itistimetowritesomecode.Remembertowrite R codeinaseparatetexteditor youcan thenpasteitintothe R console.3 Yououghttoexecuteallthecodeinthissectiononyour owncomputer.Ascriptincludingallthecodeinthissection(aswellasotherscripts, includingallthecodeinthisbook)isavailableatgithub.com/mdedge/stfs/.

Inthistutorial,wewillconductsome exploratorydataanalysis ofthe R dataset iris, whichalsodiscussedinAppendixA.The iris dataframeisbuiltinto R,alongwithmany otherdatasets.4 The iris dataframeincludesasetofmeasurementson50iris flowersfrom eachofthreedifferentspecies Irissetosa, Irisvirginica,and Irisversicolor.

The iris datasetisbuiltinto R;youdonotneedtodoanythingtoinstallit.Youcansee thewholedatasetbyentering iris atthecommandline.Itisusuallymoreusefulto examinejustthe firstfewrowsofadataframe,whichyoucanseeusingthe head() function: >head(iris)

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies 15.13.51.40.2setosa 24.93.01.40.2setosa 34.73.21.30.2setosa 44.63.11.50.2setosa 55.03.61.40.2setosa 65.43.91.70.4setosa

head() isan R function.Likemathematicalfunctions(seeAppendixA), R functionstake inputs,orarguments,andreturnoutputs.Inadditiontothefunctionsbuiltinto R,thousandsmoreareavailableinadd-onpackages,andyoucanalsowriteyourownfunctions.

3 Twootherwaystotransfercodefromthetexteditorto R:(1)IfyouareusingRStudio,youcanhighlightcode writteninthesource fileandrunitbyholdingctrl(or,onaMac,cmd)andhittingthereturnkey.(2)The source() functionisanotherwaytorunallthecommandswritteninatext file.See help(source)

4 Youcanseethenamesofallthebuilt-indatasetsusing library(help="datasets") .Typically,onecan refertoabuilt-indatasetsimplybytypingitsnameinthe R console,unlessthatnamehasbeenusedfor somethingelseinthesession.Thebuilt-indatasetswillnotbelistedbythe ls() function(orshownamong theobjectsintheenvironmentinRStudio)unlesstheyareloadedwiththe data() function,asin data ("iris").

Wehaveexplicitlyspecifiedtheargument iris,indicatinginthiscasethatwewantto seethe firstfewlinesoftheobjectnamed iris.Thereisanotherargumentto head() that wecanleaveunspecifiedifweare finewiththedefaultvalues,butwhichwecouldalso changeifwewanted.Forexample,thecommand head(iris,n=10) wouldproducethe first10linesof iris,ratherthanthe first6lines.Functionargumentshavenames the equalssignin n=10 indicatesthatweareassigningtheargumentnamed n thevalue10. Thereissome flexibilityintheexplicitnamingofarguments forexample,wedidnot write x=iris eventhoughthe firstargumentof head() isnamed x.Theruleisthatifthe argumentsaregivenintheorderthefunctionexpectsthem inthecaseof head(), x first and n second thenonedoesnothavetonamethem.Forexample,these fivecommands givethesameresults

>head(iris,10)

>head(iris,n=10)

>head(x=iris,10)

>head(x=iris,n=10)

>head(n=10,x=iris)

Butiftheargumentsaregiveninanorderdifferentthan R expectsandnotnamed,youwill haveproblems.Thecall head(10,iris),forexample,givesanerror.Youcanseethe argumentsthateachfunctionexpectsinorderusing help() . iris isa dataframe,whichmeansthatitcanholdrowsandcolumnsofdata,thatthe rowsandcolumnscanbenamed,andthatdifferentcolumnscanstoredataofdifferent types(numeric,character,etc.).Here,wecanseethateachrowcontainsanindividual flower,andthecolumnscontainmeasurementsofdifferentfeaturesofeach flower.Each columnofthedataframeisavector,anorderedcollectionofdataofonetype. Wecanquicklygainsomeusefulinformationusingthe summary() function:

>summary(iris)

Sepal.LengthSepal.WidthPetal.LengthPetal.Width Min.:4.300Min.:2.000Min.:1.000Min.:0.100 1stQu.:5.1001stQu.:2.8001stQu.:1.6001stQu.:0.300 Median:5.800Median:3.000Median:4.350Median:1.300 Mean:5.843Mean:3.057Mean:3.758Mean:1.199 3rdQu.:6.4003rdQu.:3.3003rdQu.:5.1003rdQu.:1.800 Max.:7.900Max.:4.400Max.:6.900Max.:2.500

Species setosa:50 versicolor:50 virginica:50

Ifwewouldrathernotseeresultsforthewholedatasetatonce,wecanprobetheindividual variables.Werefertoindividualvariablesinadataframebytypingthenameofthe dataframe first,then $,andthenthenameofthevariablewewanttoreference.Try entering iris$Sepal.Length intothe R terminal.Thentrygivingthatstatementasan argumentto summary(),asin

>summary(iris$Sepal.Length)

Min.1stQu.MedianMean3rdQu.Max. 4.3005.1005.8005.8436.4007.900

Ifwewouldrathernotseealltheinformationprovidedbythesummary,wecanalsoaskfor specificvalues,likethe samplemean ormedian:5

>mean(iris$Sepal.Length)

[1]5.843333

>median(iris$Sepal.Length)

[1]5.8

Histogramsprovideavisualsummaryofthedataforonevariable.

>hist(iris$Sepal.Length)

producesabasichistogramofthesepallengthdata.Youcanimproveitbychangingthe axislabelusingthe xlab argument.Forexample,thecommand >hist(iris$Sepal.Length,xlab="SepalLength",main="")

producesaplotsimilartotheoneshowninFigure2-1.(Ihavemadesomealterationstothe figureforaestheticreasons;youcanviewthecodeusedtomakeFigure2-1 andallthe other R figuresinthebook atgithub.com/mdedge/stfs/.)Noticethattheargumentsare separatedbycommasandthatwhenwewanttorefertostringsofcharactersratherthan namedvariables,weputtextinquotations.Youcanseeotheroptionsfor hist() using help(hist)

The Iris dataarewell-knownbecauseR.A.Fisher probablythetwentiethcentury’smost importantstatistician usedtheminafamousstudy.6 WhenFisherexaminedthe Iris data,

Figure2-1 Histogramsshowtheempiricaldistributionofone-dimensionaldata.Theverticalaxis showsthecount(orproportion)oftheobservationsthatfallintherangeshownonthe horizontalaxis.

5 Themeanisthearithmeticaverage,thesumoftheobservationsdividedbythenumberofobservationsinthe setbeingsummed.Amedianisanumberthatfallsatthe50thpercentileofthedata anumberthanwhichhalf oftheobservationsaregreaterandhalfoftheobservationsaresmaller.

6 Thestudy(Fisher,1936)appearedinthe AnnalsofEugenics.Itisdistressingtolearnbutimportantto acknowledgethatthedevelopmentofstatisticsinthelatenineteenthandearlytwentiethcenturieswas

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.