IntroductiontoDataAnalysis
ConceptsinChapter1:
• ScientificMethodandStatisticalAnalysis
• Parameters:DescriptiveCharacteristicsofPopulations
• Statistics:DescriptiveCharacteristicsofSamples
• VariableTypes:Continuous,Discrete,Ranked,andCategorical
• MeasuresofCentralTendency:Mean,Median,andMode
• MeasuresofDispersion:Range,Variance,StandardDeviation,andStandard Error
• DescriptiveStatisticsforFrequencyData
• E↵ectsofCodingonDescriptiveStatistics
• TablesandGraphs
• QuartilesandBoxPlots
• Accuracy,Precision,andthe30–300Rule
1.1Introduction
Themodernstudyofthelifesciencesincludesexperimentation,datagathering,and interpretation.Thistexto↵ersanintroductiontothemethodsusedtoperformthese fundamentalactivities.
Thedesignandevaluationofexperiments,knownasthe scientificmethod, is utilizedinallscientificfieldsandisoftenimpliedratherthanexplicitlyoutlinedin manyinvestigations.Thecomponentsofthescientificmethodincludeobservation, formulationofapotentialquestionorproblem,constructionofahypothesis,followed byaprediction,andthedesignofanexperimenttotesttheprediction.Let’sconsider thesecomponentsbriefly.
ObservationofaParticularEvent
Generallyanobservationcanbeclassifiedaseitherquantitativeorqualitative.Quantitativeobservationsarebasedonsomesortofmeasurement,forexample,length, weight,temperature,andpH.Qualitativeobservationsarebasedoncategoriesreflectingaqualityorcharacteristicoftheobservedevent,forexample,maleversusfemale, diseasedversushealthy,andmutantversuswildtype.
StatementoftheProblem
Aseriesofobservationsoftenleadstotheformulationofaparticularproblemor unansweredquestion.Thisusuallytakestheformofa“why”questionandimplies
acauseande↵ectrelationship.Forexample,supposeuponinvestigatingaremote Fijianislandcommunityyourealizedthatthevastmajorityoftheadultssu↵erfrom hypertension(abnormallyelevatedbloodpressureswiththesystolicover165mmHg andthediastolicover95mmHg).Notethattheindividualobservationsherearequantitativewhilethepercentagethatarehypertensiveisbasedonaqualitativeevaluation ofthesample.Fromthesepreliminaryobservationsonemightformulatethequestion: Whyaresomanyadultsinthispopulationhypertensive?
FormulationofaHypothesis
Ahypothesisisatentativeexplanationfortheobservationsmade.Agoodhypothesis suggestsacauseande↵ectrelationshipandistestable.
TheFijiancommunitymaydemonstratehypertensionbecauseofdiet,lifestyle, geneticmakeup,orcombinationsofthesefactors.Becausewe’venoticedextraordinary consumptionofoctopiintheirdietandknowingoctopodshaveaveryhighcholesterol content,wemighthypothesizethat thehighlevelofhypertensioniscausedbydiet.
MakingaPrediction
Ifthehypothesisisproperlyconstructed,itcanandshouldbeusedtomakepredictions.Predictionsarebasedondeductivereasoningandtaketheformofan“if-then” statement.Forexample,agoodpredictionbasedonthehypothesisabovewouldbe: Ifthehypertensioniscausedbyahighcholesteroldiet,thenchangingthediettoalow cholesteroloneshouldlowertheincidenceofhypertension.
Thecriteriaforavalid(properlystated)predictionare:
1. An“if”clausestatingthehypothesis.
2. A“then”clausethat
(a) suggestsalteringacausativefactorinthehypothesis(changeofdiet);
(b) predictstheoutcome(lowerlevelofhypertension);
(c) providesthebasisforanexperiment.
DesignoftheExperiment
Theentirepurposeanddesignofanexperimentistoaccomplishonegoal,thatis, totestthehypothesis.Anexperimentteststhehypothesisbytestingthecorrectness orincorrectnessofthepredictionsthatcamefromit.Theoretically,anexperiment shouldalterortestonlythefactorsuggestedbytheprediction,whileallotherfactors remainconstant.
Howwouldyoudesignanexperimenttotestthediethypothesisinthehypertensive population?
Thebestwaytotestthehypothesisaboveisbysettingupacontrolledexperiment. Thismightinvolveusingtworandomlychosengroupsofadultsfromthecommunity andtreatingbothidenticallywiththeexceptionoftheonefactorbeingtested.The controlgrouprepresentsthe“normal”situation,hasallfactorspresent,andisused asastandardorbasisforcomparison.Theexperimentalgrouprepresentsthe“test” situationandincludesallfactorsexceptthevariablethathasbeenaltered,inthiscase
thediet.Ifthegroupwiththelowcholesteroldietexhibits significantly lowerlevels ofhypertension,thehypothesisissupportedbythedata.Ontheotherhand,ifthe changeindiethasnoe↵ectonhypertension,thenaneworrevisedhypothesisshould beformulatedandtheexperimentalprocedureredesigned.Finally,thegeneralizations thataredrawnbyrelatingthedatatothehypothesiscanbestatedasconclusions. Whilethesestepsoutlinedabovemayseemstraightforward,theyoftenrequire considerableinsightandsophisticationtoapplyproperly.Inourexample,howthe groupsarechosenisnotatrivialproblem.Theymustbeconstructedwithoutbiasand mustbelargeenoughtogivetheresearcheranacceptablelevelofconfidenceinthe results.Further,howlargeachangeissignificantenoughtosupportthehypothesis? Whatis statisticallysignificant maynotbe biologicallysignificant
Afoundationinstatisticalmethodswillhelpyoudesignandinterpretexperiments properly.Thefieldofstatisticsisbroadlydefinedasthemethodsandproceduresfor collecting,classifying,summarizing,andanalyzingdata,andutilizingthedatatotest scientifichypotheses.Theterm statistics isderivedfromtheLatinforstate,andoriginallyreferredtoinformationgatheredinvariouscensusesthatcouldbenumerically summarizedtodescribeaspectsofthestate,forexample,bushelsofwheatperyear, ornumberofmilitary-agedmen.Overtimestatisticshascometomeanthescientific studyofnumericaldatabasedonnaturalphenomena.Statisticsappliedtothelife sciencesisoftencalled biostatistics or biometry.Thefoundationsofbiostatistics gobackseveralhundredyears,butstatisticalanalysisofbiologicalsystemsbegan inearnestinthelatenineteenthcenturyasbiologybecamemorequantitativeand experimental.
1.2PopulationsandSamples
Todayweusestatisticsasameansofinformingthedecision-makingprocessesinthe faceoftheuncertaintiesthatmostrealworldproblemspresent.Oftenwewishto makegeneralizationsaboutpopulationsthataretoolargeortoodi culttosurvey completely.Inthesecaseswesamplethepopulationandusecharacteristicsofthe sampletoextrapolatetocharacteristicsofthelargerpopulation.SeeFigure1.1.
Real-worldproblemsconcernlargegroupsor populations aboutwhichinferences mustbemade.(Isthereasizedi↵erencebetweentwocolormorphsofthesamespecies ofseastar?Aretheo↵springofacertaincrossoffruitfliesina3:1ratioofnormalto eyeless?)Certaincharacteristicsofthepopulationareofparticularinterest(systolic bloodpressure,weightingrams,restingbodytemperature).Thevaluesofthese characteristicswillvaryfromindividualtoindividualwithinthepopulation.These characteristicsarecalled randomvariables becausetheyvaryinanunpredictable wayorinawaythatappearsorisassumedtodependonchance.Thedi↵erenttypes ofvariablesaredescribedinSection1.3.
Adescriptivemeasureassociatedwitharandomvariablewhenitisconsidered overthe entirepopulation iscalleda parameter.Examplesarethemeanweightof allgreenturtles, Cheloniamydas,orthevarianceinclutchsizeofalltigersnakes, Notechisscutatus.Ingeneral,suchparametersaredi cult,ifnotimpossible,to determinebecausethepopulationistoolargeorexpensivetostudyinitsentirety. Consequently,oneisforcedtoexamineasubsetor sample ofthepopulationandmake inferencesabouttheentirepopulationbasedonthissample.Adescriptivemeasure associatedwitharandomvariableofa sample iscalleda statistic.Themeanweight
Population(s)havetraitscalledrandomvariables. Summarycharacteristicsofthepopulationrandomvariables arecalledparameters: µ, 2 , N .
Randomsamplesofsize n ofthepopulation(s)generatenumericaldata: Xi ’s.
Thesedatacanbeorganizedinto summarystatistics: X , s2 , n, graphs,andfigures(Chapter1).
Thedatacanbeanalyzedusing anunderstandingofbasicprobability(Chapters2–4) andvarioustestsofhypotheses(Chapters5–11).
Theanalysesleadtoconclusionsorinferences aboutthepopulation(s)ofinterest.
FIGURE1.1. Thegeneralapproachtostatisticalanalysis.
of25femalegreenturtleslayingeggsonHeronIslandorthevariabilityinclutchsize of50clutchesoftigersnakeeggscollectedinsoutheasternQueenslandareexamples ofstatistics.
Whilesuchstatisticsarenotequaltothepopulationparameters,itishopedthat theyaresu cientlyclosetothepopulationparameterstobeusefulorthatthepotentialerrorinvolvedcanbequantified.Samplestatisticsalongwithanunderstanding ofprobabilityformthefoundationforinferencesaboutpopulationparameters.See Figure1.1forreview.
Chapter1providestechniquesfororganizingsampledata.Chapters2through4 presentthenecessaryprobabilityconcepts,andtheremainingchaptersoutlinevarious techniquestotestawiderangeofpredictionsfromhypotheses.
ConceptChecks. Attheendofseveralofthesectionsineachchapterweincludeoneor twoquestionsdesignedasarapidcheckofyourmasteryofacentralideaofthesection’s content.Thesequestionswillbebemosthelpfulifyoudoeachasyouencounteritinthe text.Answerstothesequestionsaregivenattheendofeachchapterjustbeforetheexercises.
ConceptCheck1.1. Whichofthefollowingarepopulationsandwhicharesamples?
(a )Theweightsof25randomlychoseneighthgradeboysintheDetroitpublicschool system.
(b )ThenumberofeggsfoundineachospreynestonMt.DesertIslandinMaine.
(c )Theheightsof15redwoodtreesmeasuredintheMuirWoodsNationalMonument, anoldgrowthcoastredwoodforest.
(d )Thelengthsofalltheblindcavefish, Astyanasmexicanus,inasmallcavernsystem incentralMexico.
1.3VariablesorDataTypes
Thereareseveraldatatypesthatariseinstatistics.Eachstatisticaltestrequiresthat thedataanalyzedbeofaspecifiedtype.Herearethemostcommontypesofvariables.
1. Quantitativevariables fallintotwomajorcategories:
(a) Continuousvariables or intervaldata canassumeanyvalueinsome (possiblyunbounded)intervalofrealnumbers.Commonexamplesinclude length,weight,temperature,volume,andheight.Theyarisefrommeasurement.
(b) Discretevariables assumeonlyisolatedvalues.Examplesincludeclutch size,treesperhectare,armsperseastar,oritemsperquadrat.Theyarise fromcounting.
2. Ranked(ordinal)variables arenotmeasuredbutnonethelesshaveanatural ordering.Forexample,candidatesforpoliticalo cecanberankedbyindividual voters.Orstudentscanbearrangedbyheightfromshortesttotallestand correspondinglyrankedwithouteverbeingmeasured.Therankvalueshaveno inherentmeaningoutsidethe“order”thattheyprovide.Thatis,acandidate ranked2isnottwiceaspreferableasthepersonranked1.(Comparethiswith measurementvariableswhereaplant2feettall is twiceastallasaplant1foot tall.Withmeasurementvariablessuchratiosaremeaningful,whilewithordinal variablestheyarenot.)
3. Categoricaldata arequalitativedata.Someexamplesarespecies,gender, genotype,phenotype,healthy/diseased,andmaritalstatus.Unlikewithranked data,thereisno“natural”orderingthatcanbeassignedtothesecategories.
Whenmeasurementvariablesarecollectedforeitherapopulationorasample,the numericalvalueshavetobeabstractedorsummarizedinsomeway.Thesummarydescriptivecharacteristicsofapopulationofobjectsarecalled populationparameters orjust parameters.Thecalculationofaparameterrequiresknowledgeofthemeasurementvariablesvaluefor every memberofthepopulation.Theseparametersare usuallydenotedbyGreeklettersanddonotvarywithinapopulation.Thesummary descriptivecharacteristicsofasampleofobjects,thatis,asubsetofthepopulation, arecalled statistics.Samplestatisticscanhavedi↵erentvalues,dependingonhow thesampleofthepopulationwaschosen.Statisticsaredenotedbyvarioussymbols, but(almost)neverbyGreekletters.
1.4MeasuresofCentralTendency:Mean,Median,andMode
Mean
Thereareseveralcommonlyusedmeasurestodescribethelocationorcenterofa populationorsample.Themostwidelyutilizedmeasureofcentraltendencyisthe arithmeticmean or average
The populationmean isthesumofthevaluesofthevariableunderstudydivided bythetotalnumberofobjectsinthepopulation.Itisdenotedbyalowercase µ (“mu”).Eachvalueisalgebraicallydenotedbyan X withasubscriptdenotation i
Forexample,asmalltheoreticalpopulationwhoseobjectshadvalues1,6,4,5,6,3, 8,7wouldbedenoted
Wewoulddenotethepopulationsizewithacapital N .Inourtheoreticalpopulation N =8.
Thepopulationmean µ wouldbe
FORMULA1.1. Thealgebraicshorthandformulaforapopulationmeanis µ = PN i=1 Xi N
TheGreekletter ⌃ (“sigma”)indicatessummation.Thesubscript i =1indicates tostartwiththefirstobservationandthesuperscript N meanstocontinueuntiland includingthe N thobservation.Thesubscriptandsuperscriptmayrepresentother startingandstoppingpointsforthesummationwithinthepopulationorsample.For theexampleabove,
i=2 Xi wouldindicatethesumof X2 + X3 + X4 + X5 or6+4+5+6=21.
Noticealsothat N X i=1 Xi iswritten PN i=i Xi whenthesummationsymbolisembed-
dedinasentence.Infact,tofurtherreduceclutter,thesummationsignmaynot indexedatall,forexample P Xi .Itisimpliedthattheoperationofadditionbegins withthefirstobservationandcontinuesthroughthelastobservationinapopulation, thatis,
Ifsigmanotationisnewtoyouorifyouwishaquickreviewofitsproperties,read AppendixA.1beforecontinuing.
FORMULA1.2. Thesamplemeanisdefinedby X = Pn i=1 Xi n , where n isthesamplesize. Thesamplemeanisusuallyreportedtoonemoredecimalplace thanthedataandalwayshasappropriateunitsassociatedwithit.
Thesymbol X (read“X bar”)indicatesthattheobservationsofasubsetofsize n fromapopulationhavebeenaveraged. X isfundamentallydi↵erentfrom µ because samplesfromapopulationcanhavedi↵erentvaluesfortheirsamplemean,thatis, theycanvaryfromsampletosamplewithinthepopulation.Thepopulationmean, however,isconstantforagivenpopulation.
Againconsiderthesmalltheoreticalpopulation1, 6, 4, 5, 6, 3, 8, 7.Asampleofsize 3mayconsistof5, 3, 4with X =4or6, 8, 4with X =6.
SECTION 1.4:MeasuresofCentralTendency:Mean,Median,andMode7
Actuallythereare56possiblesamplesofsize3thatcouldbedrawnfromthe populationin(1.1).Onlyfoursampleshavea sample meanthesameasthepopulation mean,thatis, X = µ:
SampleSum X
Eachsamplemean X isanunbiasedestimateof µ butdependsonthevalues includedinthesampleandsamplesizeforitsactualvalue.Wewouldexpectthe averageofallpossible X ’stobeequaltothepopulationparameter, µ.Thisis,infact, thedefinitionofan unbiasedestimator ofthepopulationmean.
Ifyoucalculatethesamplemeanforeachofthe56possiblesampleswith n =3 andthenaveragethesesamplemeans,theywillgiveanaveragevalueof5,thatis, thepopulationmean, µ.Rememberthatmostrealpopulationsaretoolargeortoo di culttocensuscompletely,sowemustrelyonusingasinglesampletoestimateor approximatethepopulationcharacteristics.
Median
Thesecondmeasureofcentraltendencyisthemedian.The median isthe“middle” valueofan ordered listofobservations.Thoughthisideaissimpleenough,itwill proveusefultodefineitintermsofanevensimplernotion.The depth ofavalue isitspositionrelativetothenearestextreme(end)whenthedataarelistedinorder fromsmallesttolargest.
EXAMPLE1.1. Thetablebelowgivesthecircumferencesatchestheight(CCH)(in cm)andtheircorrespondingdepthsfor15sugarmaples, Acersaccharum,measured inaforestinsoutheasternOhio.
CCH1821222929363738565966708893120
Depth123456787654321
The populationmedian M istheobservationwhosedepthis d = N +1 2 ,where N isthepopulationsize.
NotethatthisparameterisnotaGreekletterandisseldomcomputedinpractice. Ratherasamplemedian X (read“X tilde”)isthestatisticusedtoapproximate orestimatethepopulationmedian. X isdefinedastheobservationwhosedepthis d = n+1 2 ,where n isthesamplesize.InExample1.1,thesamplesizeis n =15,sothe depthofthesamplemedianis d =8.Thesamplemedian X = X n+1 2 = X8 =38cm.
EXAMPLE1.2. ThetablebelowgivesCCH(incm)for12cypresspines, Callitris preissii,measurednearBrownLakeonNorthStradbrokeIsland.
CCH1719313948566873737580122 Depth123456654321
Since n =12,thedepthofthemedianis 12+1 2 =6.5.Obviouslynoobservation hasdepth6.5,sothisisinterpretedastheaverageofbothobservationswhosedepth is6inthelistabove.So X = 56+68 2 =62cm.
Mode
The mode isdefinedasthemostfrequentlyoccurringvalueinadataset.Themode ofExample1.2wouldbe73cm,whileExample1.1wouldhaveamodeof29cm. Insymmetricaldistributionsthemean,median,andmodearecoincident.Bimodal distributionsmayindicateamixtureofsamplesfromtwopopulations,forexample, weightsofmalesandfemales.Whilethemodeisnotoftenusedinbiologicalresearch, reportingthenumberofmodes,ifmorethanone,canbeinformative.
Eachmeasureofcentraltendencyhasdi↵erentfeatures.Themeanisapurposeful measureonlyforaquantitativevariable,whetheritiscontinuous(forexample,height) ordiscrete(forexample,clutchsize).Themediancanbecalculatedwhenevera variablecanberanked(includingwhenthevariableisquantitative).Finally,the modecanbecalculatedforcategoricalvariables,aswellasforquantitativeandranked variables.
Thesamplemedianexpresseslessinformationthanthesamplemeanbecauseit utilizesonlytheranksandnottheactualvaluesofeachmeasurement.Themedian, however,isresistanttothee↵ectsof outliers.Extremevaluesoroutliersinasamplecandrasticallya↵ectthesamplemean,whilehavinglittlee↵ectonthemedian. ConsiderExample1.2with X =58.4cmand ˜ X =62cm.Suppose X12 hadbeenmistakenlyrecordedas1220cminsteadof122cm.Themean X wouldbecome149.9cm whilethemedian ˜ X wouldremain62cm.
1.5MeasuresofDispersionandVariability:Range, Variance,StandardDeviation,andStandardError
EXAMPLE1.3. Thetablethatfollowsgivestheweightsoftwosamplesofalbacore tuna, Thunnusalalunga (inkg).Howwouldyoucharacterizethedi↵erencesinthe samples?
Sample1Sample2
SOLUTION. Uponinvestigationweseethatbothsamplesarethesamesizeand havethesamemean, X 1 = X 2 =10 11kg.Infact,bothsampleshavethesame median.Toseethis,arrangethedatasetsinrankorderasinTable1.1.Wehave n =9,so X = X n+1 2 = X5 ,whichis9.9kgforbothsamples.
Neitherofthesampleshasamode.SobyallthedescriptorsinSection1.4these samplesappeartobeidentical.Clearlytheyarenot.Thedi↵erenceinthesamples
TABLE1.1. Theorderedsamplesof Thunnus alalunga
isreflectedinthescatterorspreadoftheobservations.Sample1ismuchmore uniformthanSample2,thatis,theobservationstendtoclustermuchnearerthe meaninSample1thaninSample2.Weneeddescriptivemeasuresofthisscatteror dispersionthatwillreflectthesedi↵erences.
Range
Thesimplestmeasureofdispersionor“spread”ofthedataistherange.
FORMULAS1.3. Thedi↵erencebetweenthelargestandsmallestobservationsinagroup ofdataiscalledthe range:
Samplerange= Xn X1
Populationrange= XN X1
Whenthedataareorderedfromsmallesttolargest,thevalues Xn and X1 arecalledthe samplerangelimits
InExample1.3wehavefromTable1.1 Sample1:range=
Therangeforeachofthesetwosamplesreflectssomedi↵erencesindispersion, buttherangeisarathercrudeestimatorofdispersionbecauseitusesonlytwoof thedatapointsandissomewhatdependentonsamplesize.Assamplesizeincreases, weexpectlargestandsmallestobservationstobecomemoreextremeand,therefore, thesamplerangetoincreaseeventhoughthepopulationrangeremainsunchanged. Itisunlikelythatthesamplewillincludethelargestandsmallestvaluesfromthe population,sothesamplerangeusuallyunderestimatesthepopulationrangeandis, therefore,abiasedestimator.
Variance
Todevelopameasurethatusesallthedatatoformanindexofdispersionconsider thefollowing.Supposeweexpresseachobservationasadistancefromthemean
xi = Xi X .Thesedi↵erencesarecalled deviates andwillbesometimespositive (Xi isabovethemean)andsometimesnegative(Xi isbelowthemean).
Ifwetrytoaveragethedeviates,theyalwayssumto0.Becausethemeanisthe centraltendencyorlocation,thenegativedeviateswillexactlycanceloutthepositive deviates.Considerasimplenumericalexample
Themean X =4,andthedeviatesare x1 = 2 x2 = 1 x3 = 3 x4 =4 x5 =2
Noticethatthenegativedeviatescancelthepositiveonessothat P(Xi X )=0. Algebraicallyonecandemonstratethesameresultmoregenerally,
Since X isaconstantforanysample,
Since X = P Xi n ,then nX = P Xi ,so
Tocircumventthisunfortunateproperty,thewidelyusedmeasureofdispersion calledthe samplevariance utilizesthesquaresofthedeviates.Thequantity
isthesumofthesesquareddeviatesandisreferredtoasthe correctedsumof squares, denotedbyCSS.Eachobservationiscorrectedoradjustedforitsdistance fromthemean.
FORMULA1.4. Thecorrectedsumofsquaresisutilizedintheformulaforthesample variance, s 2 = Pn i=1 (Xi X )2 n 1 . Thesamplevarianceisusuallyreportedtotwomoredecimalplacesthanthedataandhas unitsthatarethesquareofthemeasurementunits.
Thiscalculationisnotasintuitiveasthemeanormedian,butitisaverygood indicatorofscatterordispersion.Iftheaboveformulahad n insteadof n 1in thedenominator,itwouldbeexactlytheaveragesquareddistancefromthemean. ReturningtoExample1.3,thevarianceofSample1is0.641kg2 andthevarianceof Sample2is49.851kg2 ,reflectingthelarger“spread”inSample2. Asamplevarianceisanunbiasedestimatorofaparametercalledthe population variance.
FORMULA1.5. Apopulationvarianceisdenotedby 2 (“sigmasquared”)andisdefined by
Itreally is theaveragesquareddeviationfromthemeanforthepopulation.The n 1inFormula1.4makesitanunbiasedestimateofthepopulationparameter.(See AppendixA.2foraproof.)Rememberthat“unbiased”meansthattheaverageofall possiblevaluesof s2 foracertainsizesamplewillbeequaltothepopulationvalue 2 . Formulas1.4and1.5aretheoreticalformulasandarerathertedioustoapply directly.Computationalformulasutilizethefactthatmostcalculatorswithstatistical registerssimultaneouslycalculate n, P Xi ,and P X 2 i .
FORMULA1.6. Thecorrectedsumofsquares P(Xi X )2 maybecomputedmoresimply as
)
P X 2 i istheuncorrectedsumofsquaresand (P Xi )2 n isthecorrectionterm.
ToverifyFormula1.6,usingthepropertiesinAppendixA.1noticethat
Rememberthat X = P Xi n ,so nX = P Xi ;hence
Substituting P Xi n for X yields
FORMULA1.7. Useofthecomputationalformulaforthecorrectedsumofsquaresgives thecomputationalformulaforthesamplevariance
ReturningtoExample1.3,Sample2, X Xi =91, X X 2 i =1318 92,n =9, so s 2 = 1318.92 (91)2 9 9 1 = 1318.92 920.11 8 =
Remember,thenumeratormustalwaysbeapositivenumberbecauseit’sasumof squareddeviations.Becausethevariancehasunitsthatarethesquareofthemeasurementunits,suchassquaredkilogramsabove,theyhavenophysicalinterpretation. Withasimilarderivation,thepopulationvariancecomputationalformulacanbe showntobe 2 = P X 2 i (P Xi )2 N N
Again,thisformulaisrarelyusedsincemostpopulationsaretoolargetocensus directly.
StandardDeviation
FORMULAS1.8. Amore“natural”calculationisthe standarddeviation,whichisthe positivesquarerootofthepopulationorsamplevariance,respectively.
Thesedescriptionshavethesameunitsastheoriginalobservationsandare,inasense, theaveragedeviationofobservationsfromtheirmean.
Again,considerExample1.3.
ForSample1: s2 1 =0.641kg2 , so s1 =0.80kg.
ForSample2: s2 2 =49 851kg2 , so s2 =7 06kg
Thestandarddeviationofasampleisrelativelyeasytointerpretandclearlyreflects thegreatervariabilityinSample2comparedtoSample1. Likethemean,thestandard deviationisusuallyreportedtoonemoredecimalplacethanthedataandalwayshas appropriateunitsassociatedwithit. Boththevarianceandstandarddeviationcanbe usedtodemonstratedi↵erencesinscatterbetweensamplesorpopulations.
ThinkingaboutSumsofSquares
Ithasbeenourexperienceteachingelementarydescriptivestatisticsthatstudentshave littleproblemunderstandingmeasuresofcentraltendencysuchasthemeanandmedian. Thesamplevarianceandstandarddeviation,ontheotherhand,areoftenlessintuitiveto beginningstudents.Solet’sstepbackforamomenttocarefullyconsiderwhattheseindices ofvariabilityarereallymeasuring.
Supposeasmallsampleoflengths(incm)ofsmallmouthbassiscollected.
2732304135 Xi ’s
Thesefivefishhaveanaveragelengthof33.0cm.Somearesmallerandotherslargerthan thismean.Togetasenseofthisvariability,let’ssubtracttheaveragefromeachdatapoint (Xi 33)= xi generatingwhatiscalledthe deviate foreachvalue.Thedatawhenrescaled bysubtractingthemeanbecome
Whenweaddthesedeviations,theirsumis0,sotheirmeanisalso0.Toquantifythese deviationsand,therefore,thesample’svariability,wesquarethesedeviatestopreventthem fromalwayssummingto0.
Thesumofthesesquareddeviatesis
Thiscalculationiscalledthe corrected orrescaled sumofsquares (squareddeviates). Ifweaveragedthesecalculationsbydividingthecorrectedsumofsquaresbythesample size n =5,wewouldhaveameasureoftheaveragesquareddistanceoftheobservations fromtheirmean.Thismeasureiscalledthesample variance.However,withsamplesthis
SECTION 1.5:MeasuresofDispersionandVariability13
calculationusuallyinvolvesdivisionby n 1ratherthan n.Thismodificationaddresses issuesofbiasthatarediscussedinSection1.5andAppendixA.2.
Thepositivesquarerootofthesamplevarianceiscalledthe standarddeviation.In thiscontext,standardsignifies“usual”or“average.”Sothesamplevarianceandstandard deviationarejustmeasuringtheaverageamountthatobservationsvaryfromtheircenteror mean.Theyaresimplyaveragesofvariabilityratherthanaveragesofobservationmeasurementvalueslikethemean.Thefishsamplehadameanof33 0cmwithastandarddeviation of5 3cm.
StandardError
Themostimportantstatisticofcentraltendencyisthesamplemean.However,the meanvariesfromsampletosample(seepage7).Wenowdevelopamethodtomeasure thevariabilityofthesamplemean.
Thevarianceandstandarddeviationaremeasuresofdispersionorscatterofthe valuesofthe X ’sinasampleorpopulation.Becausemeansutilizeanumberof X ’s intheircalculation,theytendtobelessvariablethantheindividual X ’s.Anextreme valueof X (largeorsmall)contributesonlyone nthofitsvaluetothesamplemean andis,therefore,somewhatdampenedout.
Ameasureofthevariabilityin X ’sthendependsontwofactors:thevariability inthe X ’sandthenumberof X ’saveragedtogeneratethemean X .Weutilizetwo statisticstoestimatethisvariability.
FORMULAS1.9. The varianceofthesamplemean isdefinedtobe
s 2 n , andstandarddeviationofthesamplemeanor,morecommonly,the standarderror
SE= s pn
Thestandarderroristhemoreimportantofthesetwostatistics.Itsutilitywillbe becomeclearinChapter4whentheCentralLimitTheoremisoutlined.Thestandard errorisusuallyreportedtoonemoredecimalplacethanthedata,orif n islarge,to twomoreplaces.
EXAMPLE1.4. Calculatethevarianceofthesamplemeanandthestandarderror forthedatasetsinExample1.3.
SOLUTION. Thesamplesizesareboth n =9.ForSample1, s 2 =0 641kg2 ,sothe varianceofthesamplemeanis
s 2 n = 0.641 9 =0.71kg2
andthestandarddeviationis s =0.80kg,sothestandarderroris
SE= s pn = 0 80 p9 =0.27kg.
ForSample2, s 2 =49.851kg2 ,sothevarianceofthesamplemeanis
s 2 n = 49.851 9 =16 62kg2
andthestandarddeviationis s =7.06kg,sothestandarderroris
SE= s pn = 7 06 p9 =2 35kg
ConceptCheck1.2. Thefollowingdataarethecarapace(shell)lengthsincentimetersof asampleofadultfemalegreenturtles, Cheloniamydas,measuredwhilenestingatHeron IslandinAustralia’sGreatBarrierReef.Calculatethefollowingdescriptivestatisticsforthis sample:samplemean,samplemedian,correctedsumofsquares,samplevariance,standard deviation,standarderror,andrange.Remembertousetheappropriatenumberofdecimal placesinthesedescriptivestatisticsandtoincludethecorrectunitswithallstatistics.
11010511711395115989793120
1.6DescriptiveStatisticsforFrequencyTables
Whenlargedatasetsareorganizedintofrequencytablesorpresentedasgroupeddata, thereareshortcutmethodstocalculatethesamplestatistics: X , s2 ,and s
EXAMPLE1.5. Thefollowingtableshowsthenumberofsedgeplants, Carexflacca, foundin800samplequadratsinanecologicalstudyofgrasses.Eachquadratwas 1m2 . Plants/quadrat(Xi )Frequency(fi )
TocalculatethesampledescriptivestatisticsusingFormulas1.2,1.7,and1.8would bequitearduous,involvingsumsandsumsofsquaresof800numbers.Fortunately, thefollowingformulaslimitthedrudgeryforthesecalculations.
Itisclearthat X1 =0occurs f1 =268times, X2 =1occurs f2 =316times,etc., andthatthesumofobservationsinthefirstcategoryis f1 X1 ,thesuminthesecond categoryis f2 X2 ,etc.Thesumofallobservationsis,therefore,
where c denotesthenumberofcategories.Thetotalnumberofobservationsis
c i=1 fi ,andasaresult:
FORMULA1.10. Thesamplemeanforagroupeddatasetisgivenby
SECTION 1.6:DescriptiveStatisticsforFrequencyTables15
Similarly,thecomputationalformulaforthesamplevarianceforagroupeddataset canbederiveddirectlyfrom s 2 = Pc i=1 fi (Xi X )2 n 1
FORMULA1.11. Thesamplevarianceforagroupeddatasetisgivenby s 2 = P
, where n = Pc i=1 fi .
ToapplyFormulas1.10and1.11,weneedtocalculateonlythreesums:
• Thesamplesize n = P fi
• Thesumofobservations P fi Xi
• Theuncorrectedsumofsquaredobservations P fi X 2 i ReturningtoExample1.5,itisnowstraightforwardtocalculate X , s2 ,and s.
Plants/quadrat(Xi ) fi fi Xi fi X 2 i 026800 1316316316 2135270540 361183549 41560240 531575 61636 71749
Sum8008571805
Notethatcolumn4inthetableaboveisgeneratedbyfirstsquaring Xi andthen multiplyingby fi ,notbysquaringthevaluesincolumn3.Inotherwords, fi X 2 i = (fi Xi )2 . Thesamplemeanis
thesamplevarianceis
2 =
, andthesamplestandarddeviationis
s = p1 11=1 1plants/quadrat.
Example1.5summarizeddataforadiscretevariabletakingonwholenumber valuesfrom0to7.Continuousvariablescanalsobepresentedasgroupeddatain frequencytables.
EXAMPLE1.6. Thefollowingdatawerecollectedbyrandomlysamplingalarge populationofrainbowtrout, Salmogairdnerii.Thevariableofinterestisweightin pounds.
Rainbowtrouthaveweightsthatcanrangefromalmost0to20lbormore.Moreovertheirweightscantakeonanyvalueinthatinterval.Forexample,aparticular troutmayweigh7.3541lb.WhendataaregroupedasinExample1.6intervalsare impliedforeachclass.Afishinthe3-lbclassweighssomewherebetween2.50and 3.49lbandafishinthe9-lbclassweighsbetween8.50and9.49lb.Fishwereweighed tothenearestpoundallowinganalysisofgroupeddataforacontinuousmeasurement variable.InExample1.6,
Again,considerthatcalculationtimeissavedbyworkingwith13classesinstead of110individualobservations.Whethermeasuringtherainbowtrouttothenearest poundwasappropriatewillbeconsideredinSection1.10.
1.7TheE↵ectofCodingData
Whilegroupingdatacansaveconsiderabletimeande↵ort,codingdatamayalsoo↵er similarsavings.Codinginvolvesconversionofmeasurementsorstatisticsintoeasier toworkwithvaluesbysimplearithmeticoperations.Itissometimesusedtochange unitsortoinvestigateexperimentale↵ects.
AdditiveCoding
Additivecodinginvolvestheadditionorsubtractionofaconstantfromeachobservationinadataset.SupposethedatagatheredinExample1.6werecollectedusinga scalethatweighedthefish2lbtoolow.Wecouldgobacktothedataandadd2lbto eachobservationandrecalculatethedescriptivestatistics.Amoree cienttackwould betorealizethat ifafixedamount c isaddedorsubtractedfromeachobservationin adataset,thesamplemeanwillbeincreasedordecreasedbythatamount,butthe variancewillbeunchanged.
Toseewhy,if X c isthecodedmean,then
If s2 c isthecodedsamplevariance,then
therefore, sc = s
Ifthescaleweighed2lblightinExample1.6thenew,correctedstatisticswould be X c =7 1+2 0=9 1lb,and s2 c =5 75(lb)2 ,and s
MultiplicativeCoding
Multiplicativecodinginvolvesmultiplyingordividingeachobservationinadatasetby aconstant.SupposethedatainExample1.6weretobepresentedataninternational conferenceand,therefore,hadtobepresentedinmetricunits(kilograms)ratherthan Englishunits(pounds).Since1kgequals2.20lb,wecouldconverttheobservationsto kilogramsbymultiplyingeachobservationby1/2.20or0.45kg/lb.Again,themore e cientapproachwouldbetorealizethefollowing.
Ifeachoftheobservationsinadatasetismultipliedbyafixedquantity c,thenew meanis c timestheoldmeanbecause
Furtherthenewvarianceis c2 timestheoldvariancebecause
2
=
)
andfromthisitfollowsthatthenewstandarddeviationis c timestheoldstandard deviation, sc = cs.(Remember,too,thatdivisionisjustmultiplicationbyafraction.)
ToconvertthesummarystatisticsofExample1.6tometricwesimplyutilizethe formulasabovewith c =0.45kg/lb.
X c = cX =0 45kg/lb(7 1lb)=3 20kg s 2 c = c 2 s 2 =(0.45kg/lb)2 (5.75lb2 )=1.164kg2 .
sc = cs =0 45kg/lb(2 4lb)=1 08kg
Ourunderstandingofthee↵ectsofcodingondescriptivestatisticscansometimes helpdeterminethenatureofexperimentalmanipulationsofvariables.
EXAMPLE1.7. Supposethataparticularvarietyofstrawberryyieldsanaverage 50goffruitperplantinfieldconditionswithoutfertilizer.Withahighnitrogen fertilizerthisvarietyyieldsanaverageof100goffruitperplant.Anew“highyield” varietyofstrawberryyields150goffruitperplantwithoutfertilizer.Howmuch wouldtheyieldbeexpectedtoincreasewiththehighnitrogenfertilizer?
SOLUTION. Wehavetwochoiceshere:Thee↵ectofthefertilizercouldbeadditive,increasingeachvalueby50g(Xi +50)orthee↵ectofthefertilizercouldbe multiplicative,doublingeachvalue(2Xi ).Inthefirstcaseweexpecttheyieldofthe newvarietywithfertilizertobe150g+50g=200g.Inthesecondcaseweexpect theyieldofthenewvarietywithfertilizertobe2 ⇥ 150g=300g.Todi↵erentiate betweenthesepossibilitieswemustlookatthevarianceinyieldoftheoriginalvariety withandwithoutfertilizer.Ifthee↵ectoffertilizerisadditive,thevarianceswithand withoutfertilizershouldbesimilarbecauseadditivecodingdoesn’te↵ectthevariance: Xi +50yields s 2 ,theoriginalsamplevariance.Ifthee↵ectistodoubletheyield,the varianceofyieldswithfertilizershouldbefourtimesthevariancewithoutfertilizer becausemultiplicativecodingincreasesthevariancebythesquareoftheconstant usedincoding.2Xi yields4s 2 ,doublingtheyieldincreasesthesamplevariancefour fold.
1.8TablesandGraphs
Thedatacollectedinasampleareoftenorganizedintoatableorgraphasasummary representation.ThedatapresentedinExample1.5werearrangedintoafrequency tableandcouldbefurtherorganizedintoa relativefrequencytable byexpressing eachrowasapercentageofthetotalobservationsorintoa cumulativefrequency distribution byaccumulatingallobservationsuptoandincludingeachrow.Thecumulativefrequencydistributioncouldbemanipulatedfurtherintoa relativecumulativefrequencydistribution byexpressingeachrowofthecumulativefrequency distributionasapercentageofthetotal.Seecolumns3–5inTable1.2fortherelative frequency,cumulativefrequency,andrelativecumulativefrequencydistributionsfor Example1.5.(Here n = P fi and r istherownumber.)
TABLE1.2. Therelativefrequencies,cumulativefrequencies,andrelativecumulativefrequenciesforExample1.5
(100) Xi fi RelativeCumulativeRelativecumulative Plants/quadratFrequencyfrequencyfrequencyfrequency 026833.50026833.500 131639.50058473.000 213516.87571989.875 3617.62578097.500 4151.87579599.375 530.37579899.750 610.12579999.875 710.125800100.000
relativefrequencies.SeeFigure1.2.Inabargraphthebar heights aretherelative frequencies.Thebarsareofequalwidthandspacedequidistantlyalongthehorizontal axis.Becausethesedataarediscrete,thatis,becausetheycanonlytakecertainvalues alongthehorizontalaxis,thebarsdonottoucheachother.
FIGURE1.2. AbargraphofrelativefrequenciesforExample1.5.
ThedatainExample1.6canbesummarizedinasimilarfashionwithrelative frequency,cumulativefrequency,andrelativecumulativefrequencycolumns.SeeTable1.3.
TABLE1.3. Therelativefrequencies,cumulativefrequencies,andrelativecumulativefrequenciesforExample1.6
82421.828678.18 976.369384.55 1098.1810292.73 1121.8210494.55 1243.6410898.18 1321.82110100.00 P 110100.00
BecausethedatainExample1.6arecontinuousmeasurementdatawitheachclass implyingarangeofpossiblevaluesfor Xi ,forexample, Xi =3implieseachfish weighedbetween2.50lband3.49lb,thepictorialrepresentationofthedatasetis a histogram notabargraph.Histogramshavetheobservationclassesalongthe horizontalaxis.The area ofthestriprepresentstherelativefrequency.(Iftheclasses
1:IntroductiontoDataAnalysis ofthehistogramareofequalwidth,astheyoftenare,thentheheightsofthestrips willrepresenttherelativefrequency,asinabargraph.)SeeFigure1.3.Thestrips inthiscasetoucheachotherbecauseeach X valuecorrespondstoarangeofpossible values.
FIGURE1.3. AhistogramfortherelativefrequenciesforExample1.6.
Whilethecategoriesinabargrapharepredeterminedbecausethedataarediscrete,theclassesrepresentingrangesofcontinuousdatavaluesmustbeselectedby theinvestigator.Infact,itissometimesrevealingtocreatemorethanonehistogram ofthesamedatabyemployingclassesofdi↵erentwidths.
EXAMPLE1.8. Thelistbelowgivessnowfallmeasurementsfor50consecutiveyears (1951–2000)inSyracuse,NY(ininchesperyear).Thedatahavebeenrearranged inorderofincreasingannualsnowfall.Createahistogramusingclassesofwidth 30inchesandthencreateahistogramusingnarrowerclassesofwidth15inches. (Source: http://neisa.unh.edu/Climate/IndicatorExcelFiles.zip)
71.773.477.881.684.184.184.386.791.393.8 93.994.497.597.698.199.199.9100.7101.0101.9 102.1102.2104.8108.3108.5110.2111.0113.3114.2114.3 116.2119.2119.5122.9124.0125.7126.6130.1131.7133.1 135.3145.9148.1149.2153.8160.9162.6166.1172.9198.7
SOLUTION. Usethesamescaleforthehorizontalaxis(inchesofannualsnowfall)in bothhistograms.Rememberthatthe area ofastriprepresentstherelativefrequency oftheassociatedclass.Sincethesnowfallclassesofthesecondhistogram(15in)are one-halfthoseofthefirsthistogram(30in),thentheverticalscalemustbemultiplied byafactorof2sothatequalareasineachhistogramwillrepresentthesamerelative frequencies.Thus,asingleyearinthesecondhistogramwillberepresentedbyastrip halfaswidebuttwiceastallasinthefirsthistogram,asindicatedinthekeyinthe upperleftcornerofeachdiagram.
Inthiscase,thenarrowerclassesofthesecondhistogramprovidemoreinformation.Forexample,nearlyone-thirdofallrecentwintersinSyracusehaveproduced snowfallsinthe90–105inchrange.Therewasoneyearwithaverylargeamountof snowfallofapproximately200in.Whileonecouldgarnerthissameinformationfrom thedataitself,normallyonewouldusea(single)histogramtosummarizedataand notlisttheentiredataset.
FIGURE1.4. TwohistogramsforthedatainExample1.8.Theareasofthestripsrepresenttherelativefrequencies.Thesamearearepresentsthesamerelativefrequencyinbothgraphs.
Itisworthemphasizingthattomakevalidcomparisonsbetweentwohistograms, equalareasmustrepresentequalrelativefrequencies.Sincetherelativefrequenciesof alltheclassesinahistogramsumto1,thismeansthat thetotalareaundereachofthe histogramsbeingcomparedmustbethesame.LookatFigure1.4foranapplicationof thisidea.
Histogramsareoftenusedasgraphicaltestsoftheshapeofsamplesusuallytestingwhetherthedataareapproximately“bell-shaped”ornot.Wewilldiscussthe importanceofthisconsiderationinfuturechapters.
1.9QuartilesandBoxPlots
Intheprevioussectionswehaveusedsamplevariance,standarddeviation,andrange toobtainmeasuresofthespreadorvariability.Anotherquickandusefulwayto visualizethespreadofadatasetisbyconstructingaboxplotthatmakesuseof quartilesandthesamplerange.
QuartilesandFive-NumberSummaries
Asthenamesuggests,quartilesdivideadistributioninquarters.Moreprecisely,the pth percentile ofadistributionisthevaluesuchthat p percentoftheobservations fallatorbelowit.Forexample,themedianisjustthe50thpercentile.Similarly,the lower or firstquartile isthe25thpercentileandthe upper or thirdquartile is the75thpercentile.Becausethesecondquartileisthesameasthemedian,quartiles areappropriatewaystomeasurethespreadofadistributionwhenthemedianisused tomeasureitscenter.
Becausesamplesizesarenotalwaysevenlydivisibleby4toformquartiles,we needtoagreeonhowtobreakadatasetupintoapproximatequarters.Othertexts, computerprograms,andcalculatorsmayuseslightlydi↵erentruleswhichproduce slightlydi↵erentquartiles.
FORMULA1.12. Tocalculatethefirstandthirdquartiles,firstorderthelistofobservations andlocatethemedian.The firstquartile Q1 isthemedianoftheobservationsfallingbelow themedianoftheentiresampleandthe thirdquartile Q3 isthemedianoftheobservations fallingabovethemedianoftheentiresample.The interquartilerange isdefinedas
IQR= Q3 Q1 .
ThesampleIQRdescribesthespreadofthemiddle50%ofthesample,thatis,the di↵erencebetweenthefirstandthirdquartiles.Assuch,itisameasureofvariability andiscommonlyreportedwiththemedian.
EXAMPLE1.9. FindthefirstandthirdquartilesandtheIQRforthecypresspine datainExample1.2.
CCH1719313948566873737580122 Depth123456654321
SOLUTION. Themediandepthis 12+1 2 =6 5.Sotherearesixobservationsbelow themedian.Thequartiledepthisthemediandepthofthesesixobservations: 6+1 2 = 3 5.Sothefirstquartileis Q1 = 31+39 2 =35cm.Similarly,thedepthforthe thirdquartileisalso3.5(fromtheright),so Q3 = 73+75 2 =74cm.Finally,the
IQR= Q3 Q1 =74 35=39cm.