IntroductoryStatistics
AbouttheAuthor
SheldonM.Ross
SheldonM.RossreceivedhisPh.D.inStatisticsatStanfordUniversityin1968 andthenjoinedtheDepartmentofIndustrialEngineeringandOperationsResearchattheUniversityofCaliforniaatBerkeley.HeremainedatBerkeleyuntil Fall2004,whenhebecametheDanielJ.EpsteinProfessorofIndustrialand SystemsEngineeringintheDanielJ.EpsteinDepartmentofIndustrialandSystemsEngineeringattheUniversityofSouthernCalifornia.Hehaspublished manytechnicalarticlesandtextbooksintheareasofstatisticsandapplied probability.Amonghistextsare AFirstCourseinProbability (ninthedition), IntroductiontoProbabilityModels (eleventhedition), Simulation (fifthedition), and IntroductiontoProbabilityandStatisticsforEngineersandScientists (fifthedition).
ProfessorRossisthefoundingandcontinuingeditorofthejournal Probability intheEngineeringandInformationalSciences.HeisafellowoftheInstituteof MathematicalStatistics,theInstituteforOperationsResearchandManagement Sciences,andarecipientoftheHumboldtU.S.SeniorScientistAward.
2.2.1
9.6
10.2
10.6
10.7
11.3
12.12 Logistic Regression
12.1 3 Use of R in Regression
12.1 3.1 Simple Linear Regression
12.1 3.2 Mulliple Linear Regression.
12.1 3. 3 Logistic Regression.. Key Terms
Review Problems
CHAPTER13 Chi-Squared Goodness-aI-Fit Tests.
13.1 Inlroduclion
13.2 Chi-Squared Goodness-of-Fit Tests
13.3 Testing for Independence in Populations
ProbLems W4
13.4 Testing for Independence in Contingency Tables .608
ProbLems
13.5 Use of R...... Key Terms.. ................ ........................ Summary ... Review ProbLems
CHAPTER14 Nonparametric Hypotheses Tes t
14.1 Introduclion
14.2 Sign
14.2 .1 Testing the Equality of Population Distributions when Samples Are Paired.
14.2.2 One-Sided Tests
ProbLems..
14. 3 Signed-Rank Test.
14. 3.1 Zero Differences and Ties
CHAPTER16 Machine Learning and Big Data.
16.1 Introduction......... ...... ...........
16.2 Late Flight Probabilities
16.3 The Naive Bayes Approach
16.3.1 A Variation of Naive Bayes..
ProbLems
16.4 Distance Based Estimators the k-Nearest Neighbors Rule . ..................................................................
16.4.1 A Distance Weighted Melhod
ProbLems.,
16.5 Assessing the Approaches
16.6 Choosing the Best Probability, A Bandit Problem
ProbLems. ...............
APPENDIXA A Data 5e!..............
APPENDIXB Mathematical Preliminaries
APPENDIXC How to Choose a Random Sample ...
APPENDIX 0 Tables.........
APPENDIXE Programs.....
Preface
Statisticalthinkingwillonedaybeasnecessaryforefficientcitizenshipasthe abilitytoreadandwrite.
H.G.Wells(1866–1946)
Intoday’scomplicatedworld,veryfewissuesareclear-cutandwithoutcontroversy.Inordertounderstandandformanopinionaboutanissue,onemust usuallygatherinformation,ordata.Tolearnfromdata,onemustknowsomethingaboutstatistics,whichistheartoflearningfromdata.
Thisintroductorystatisticstextiswrittenforcollege-levelstudentsinanyfield ofstudy.Itcanbeusedinaquarter,semester,orfull-yearcourse.Itsonlyprerequisiteishighschoolalgebra.Ourgoalinwritingitistopresentstatistical conceptsandtechniquesinamannerthatwillteachstudentsnotonlyhowand whentoutilizethestatisticalproceduresdeveloped,butalsotounderstand whytheseproceduresshouldbeused.Asaresultwehavemadeagreateffort toexplaintheideasbehindthestatisticalconceptsandtechniquespresented. Conceptsaremotivated,illustrated,andexplainedinawaythatattemptsto increaseone’sintuition.Itisonlywhenastudentdevelopsafeelorintuition forstatisticsthatsheorheisreallyonthepathtowardmakingsenseofdata.
Toillustratethediverseapplicationsofstatisticsandtoofferstudentsdifferentperspectivesabouttheuseofstatistics,wehaveprovidedawidevariety oftextexamplesandproblemstobeworkedbystudents.Mostrefertorealworldissues,suchasguncontrol,stockpricemodels,healthissues,driving agelimits,schooladmissionages,publicpolicyissues,genderissues,useof helmets,sports,disputedauthorship,scientificfraud,andVitaminC,among manyothers.Manyofthemusedatathatnotonlyarerealbutarethemselvesofinterest.Theexampleshavebeenposedinaclearandconcisemanner andincludemanythought-provokingproblemsthatemphasizethinkingand problem-solvingskills.Inaddition,someoftheproblemsaredesignedtobe open-endedandcanbeusedasstartingpointsfortermprojects.
estedintestingwhethertheproportionsofmenandofwomenthatfavorterm limitsarethesame.
Probablythemostwidelyusedstatisticalinferencetechniqueisthatofthe analysisofvariance;thisisintroducedinChap. 11.Thistechniqueallowsus totestinferencesaboutparametersthatareaffectedbymanydifferentfactors. Bothone-andtwo-factoranalysisofvarianceproblemsareconsideredinthis chapter.
InChap. 12 welearnaboutlinearregressionandhowitcanbeusedtorelatethevalueofonevariable(say,theheightofaman)tothatofanother (theheightofhisfather).Theconceptofregressiontothemeanisdiscussed, andtheregressionfallacyisintroducedandcarefullyexplained.Wealsolearn abouttherelationbetweenregressionandcorrelation.Also,inanoptional section,weuseregressiontothemeanalongwiththecentrallimittheoremto presentasimple,originalargumenttoexplainwhybiologicaldatasetsoften appeartobenormallydistributed.
InChap. 13 wepresentgoodness-of-fittests,whichcanbeusedtotestwhether aproposedmodelisconsistentwithdata.Thischapteralsoconsiderspopulationsclassifiedaccordingtotwocharacteristicsandshowshowtotestwhether thecharacteristicsofarandomlychosenmemberofthepopulationareindependent.
Chapter 14 dealswithnonparametrichypothesistests,whichareteststhatcan beusedinsituationswheretheonesofearlierchaptersareinappropriate.
Chapter 15 introducesthesubjectmatterofqualitycontrol,akeystatistical techniqueinmanufacturingandproductionprocesses.
Chapter 16 dealswiththetopicsofmachinelearningandbigdata.Thetechniquesdescribedhavebecomepopularinrecentyearsduetothepreponderanceoflargeamountsofdata.Ageneralproblemconsideredistodetermine theprobabilitythatacrosscountryflightwillbelate,withtheflightdefinedby acharacterizingvectorgivingsuchinformationastheairline,thedepartureairport,thearrivalairport,thetimeofdeparture,andtheweatherconditions.We consideravarietyofestimationprocedures,withnameslikenaiveBayesand distancebasedapproaches.Wethenconsiderwhatareknownasbanditproblems,andwhichcanbeapplied,amongotherthings,tosequentiallychoosing amongdifferentmedicationsfortreatingaparticularmedicalcondition.
NEWTOTHISEDITION
Thefourtheditionhasmanynewandupdatedexamplesandexercises.Inaddition,arethefollowing:
1. Anewsection(Section 3.8)onLorenzCurvesandtheGiniIndex.Lorenz curvesareplots,as p rangesfrom 0 to 1,ofthefractionofthetotalin-
IntroductiontoStatistics
Statisticianshavealreadyoverruneverybranchofsciencewitharapidityof conquestrivalledonlybyAttila,Mohammed,andtheColoradobeetle.
MauriceKendall(Britishstatistician)
Thischapterintroducesthesubjectmatterofstatistics,theartoflearningfrom data.Itdescribesthetwobranchesofstatistics,descriptiveandinferential.The ideaoflearningaboutapopulationbysamplingandstudyingcertainofits membersisdiscussed.Somehistoryispresented.
1.1INTRODUCTION
Isitbetterforchildrentostartschoolatayoungerorolderage?Thisiscertainly aquestionofinteresttomanyparentsaswellastopeoplewhosetpublic policy.Howcanweanswerit?
Itisreasonabletostartbythinkingaboutthisquestion,relatingittoyourown experiences,andtalkingitoverwithfriends.However,ifyouwanttoconvince othersandobtainaconsensus,itisthennecessarytogathersomeobjective information.Forinstance,inmanystates,achievementtestsaregiventochildrenattheendoftheirfirstyearinschool.Thechildren’sresultsonthese testscanbeobtainedandthenanalyzedtoseewhetherthereappearstobea connectionbetweenchildren’sagesatschoolentranceandtheirscoresonthe test.Infact,suchstudieshavebeendone,andtheyhavegenerallyconcluded thatolderstudententrantshave,asagroup,faredbetterthanyoungerentrants. However,ithasalsobeennotedthatthereasonforthismayjustbethatthose studentswhoenteredatanolderagewouldbeolderatthetimeoftheexamination,andthisbyitselfmaybewhatisresponsiblefortheirhigherscores.For instance,supposeparentsdidnotsendtheir6-year-oldstoschoolbutrather waitedanadditionalyear.Then,sincethesechildrenwillprobablylearnagreat dealathomeinthatyear,theywillprobablyscorehigherwhentheytakethe testattheendoftheirfirstyearofschoolthantheywouldhaveiftheyhad startedschoolatage6.