BiostatisticsandComputer-basedAnalysisof HealthDatausingR1stEditionChristopheLalanne

https://ebookmass.com/product/biostatistics-and-computerbased-analysis-of-health-data-using-r-1st-edition-
christophe-lalanne/

Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...
Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson
https://ebookmass.com/product/exploratory-data-analysis-using-r-1stedition-ronald-k-pearson/ ebookmass.com
Analysis and Visualization of Discrete Data Using Neural Networks Koji Koyamada
https://ebookmass.com/product/analysis-and-visualization-of-discretedata-using-neural-networks-koji-koyamada/
ebookmass.com
Using R For Data Analysis In Social Sciences: A Research Project-oriented Approach Li
https://ebookmass.com/product/using-r-for-data-analysis-in-socialsciences-a-research-project-oriented-approach-li/
ebookmass.com
Changing perspective: An “optical” approach to creativity Stoyan V. Sgourev
https://ebookmass.com/product/changing-perspectivean-a%c2%a2aaopticala%c2%a2aa-approach-to-creativity-stoyan-v-sgourev/
ebookmass.com




Real
Estate Valuation and Strategy: A Guide for Family Offices and Their Advisors John A Kilpatrick
https://ebookmass.com/product/real-estate-valuation-and-strategy-aguide-for-family-offices-and-their-advisors-john-a-kilpatrick/
ebookmass.com
Language Disorders: A Functional Approach to Assessment and Intervention
https://ebookmass.com/product/language-disorders-a-functionalapproach-to-assessment-and-intervention/
ebookmass.com
Simulation Sheldon M. Ross
https://ebookmass.com/product/simulation-sheldon-m-ross/



ebookmass.com
Euroscepticism and the Future of European Integration De Vries
https://ebookmass.com/product/euroscepticism-and-the-future-ofeuropean-integration-de-vries/
ebookmass.com
Fortune Favors the Duke Kristin Vayden
https://ebookmass.com/product/fortune-favors-the-duke-kristinvayden-3/
ebookmass.com


A
https://ebookmass.com/product/human-resources-management-for-publicand-nonprofit-organizations-a/
ebookmass.com

Biostatistics and Computer-based Analysis of Health Data using R
Biostatistics and Health Science Set coordinated by Mounir Mesbah
Biostatistics and Computer-based Analysis of Health Data using R


First published 2016 in Great Britain and the United States by ISTE Press Ltd and Elsevier Ltd
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Press Ltd
27-37 St George’s Road
Elsevier Ltd
The Boulevard, Langford Lane London SW19 4EU Kidlington, Oxford, OX5 1GB UK UK
www.iste.co.uk
www.elsevier.com
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
For information on all our publications visit our website at http://store.elsevier.com/
© ISTE Press Ltd 2016
The rights of Christophe Lalanne and Mounir Mesbah to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library Library of Congress Cataloging in Publication Data
A catalog record for this book is available from the Library of Congress
ISBN 978-1-78548-088-1
Printed and bound in the UK and US
Alargenumberoftheactionsperformedbymeansofstatisticalsoftwareamount tomanipulatingoreventotransformingdigitaldatarepresentingstatisticaldata literally.Itisthereforeparamountweunderstandhowstatisticaldataarerepresented andhowtheycanbeusedbysoftwaresuchas R.Afterimporting,recodingandthe eventualtransformationofthesedata,thedescriptionofthevariablesofinterestand thesummaryoftheirdistributioninnumericalandgraphicalformconstituteaprior andfundamentalsteptoanystatisticalmodeling,hencetheimportanceoftheseearly stagesinastatisticalanalysisproject.Inthesecondstep,itisessentialtofully controlthecommandsthatenablethecalculationofthemainmeasuresofassociation inmedicalresearchandtoknowhowtoimplementtheconventionalexplanatoryand predictivemodels:varianceanalysis,linearandlogisticregressionandCoxmodel. Withfewexceptions,usingthe R commandsavailableduringtheinstallationofthe software(basecommands)isfavoredovertheuseofspecializedcommandsin external R packages.Thepackagesthatmustbeinstalledtofollowtheapplications presentedinthisbookarelistedinChapter1,insection1.1.
Thisbookassumesthatthereaderisalreadyfamiliarwithbasicstatistical concepts,particularlythecalculationofcentraltendencyanddispersionindicators foracontinuousvariable,contingencytables,analysisofvarianceandconventional regressionmodels.Theobjectiveistoapplythisknowledgeusingdatasetsdescribed innumerousotherworks,eveniftheinterpretationoftheresultsremainsminimal, andfamiliarizeoneselfquicklywiththeuseof R withactualdata.Emphasisisgiven tothemanagementandmanipulationofstructureddata,asthisconstitutes60to80% oftheworkofthestatistician.TherearemanyworksinFrenchandinEnglishon R, bothfromthetechnicalandstatisticalpointofview.Someoftheseworksare orientedtowardsgeneralaspects[SHA12],othersaremuchmorespecialized [BIL14]oraddressmoreadvancedconcepts[HOT09].Thepurposeofthisbookis toenablethereadertogetaccustomedto R sothathecanperformhisownanalyses
andcontinuehisapprenticeshipinanautonomouswayinthefieldofmedical statistics.
InChapter1,thebasecommandsforthemanagementofdatawith R are introduced.Thefocusisonthecreationandmanipulationofquantitativeand qualitativevariables(recodingindividualfrequencies,countingmissing observations),importingdatabasesstoredintheformoftextfiles,andbasic arithmeticoperations(minimum,maximum,arithmeticmean,difference,frequency, etc.).Wewillalsoconsiderhowtostorepreprocesseddatabasesintextorin R formats.Theobjectiveistounderstandhowthedataarerepresentedin R andhowto workwiththem.
Chapter2focusesonusefulcommandsforthedescriptionofadatatable comprisedofquantitativeorqualitativevariables.Thedescriptiveapproachisstrictly univariate,whichistheprerequisiteforanystatisticalapproach.Basicgraphic commands(histograms,densitycurves,barsorpointsdiagrams)willbepresentedin additiontotheusualcentraltendency(mean,median)anddispersion(variance, quartiles)descriptivesummaries.Pointwiseandintervalestimationusingmeansand empiricalproportionswillalsobeaddressed.Theobjectiveistobecomeacclimatized withtheuseofsimple R commandsoperatingonavariable,optionallyspecifying someoptionsforthecalculation,andalsowiththeselectionofstatisticalunits.
Chapter3isdevotedtothecomparisonoftwosamplesforquantitativeor qualitativemeasurements.Thefollowinghypothesistestsareaddressed:Student’s t-testforindependentorpairedsamples,non-parametricWilcoxontest, χ2 testand Fisher’sexacttest,McNemartest,fromthemainassociationmeasurementsfortwo variables(meandifference,oddsratioandrelativerisk).Fromthischapteron,there willbelessemphasisontheunivariatedescriptionofeachvariable,butitisadvisable tocarryoutexploratorydataanalysisasdiscussedinChapter2.Theobjectiveisto controlthemainstatisticaltestsinthecasewhereweareinterestedintherelationship betweenaquantitativevariableandaqualitativevariableorinthecaseoftwo qualitativevariables.
Chapter4isanintroductiontotheanalysisofvariance(ANOVA)inwhichthe objectiveistoexplainthevariabilityobservedatthelevelofanumericresponse variablebytakingintoaccountagrouporaclassificationfactorandmeandifferences intervalestimation.WewillfocusontheconstructionofanANOVAtable summarizingthevarioussourcesofvariabilityandonthegraphicmethodsallowing ustosummarizethedistributionofindividualoraggregateddata.Thelinear tendencytestwillalsobestudiedinthecasewheretheclassificationfactorcanbe consideredasbeingnaturallyordered.Theobjectiveistounderstandhowto constructanexplanatorymodelwherethereisoneoreventwoexplanatoryfactors andpresenttheresultsofsuchamodelthroughtheuseof R digitallyandgraphically.
Chapter5focusesontheanalysisofthelinearrelationshipbetweentwo continuousquantitativevariables.Inthelinearcorrelationapproach,whichassumesa symmetricalrelationbetweenthetwovariables,weareinterestedinthe quantificationofthemagnitudeanddirectionoftheassociationinaparametric (Pearsoncorrelation)ornon-parametricmanner(rank-basedSpearmancorrelation) andonthegraphicrepresentationofthisrelation.Simplelinearregressionwillbe usedintheeventthatoneofthetwonumericvariablesassumesthefunctionofa responsevariableandtheotheroneofanexplanatoryvariable.Theusefulcommands fortheestimationofthecoefficientsoftheregressionline,theconstructionofthe ANOVAtableassociatedwiththeregressionandthepredictionwillbepresented. TheobjectiveofthischapterremainsidenticaltothatofChapter4,i.e.topresentthe R commandsnecessaryfortheconstructionofasimplestatisticalmodelwithtwo variablesfollowinganexplanatoryorpredictiveperspective.
InChapter6,themainmeasuresofassociationfoundinepidemiologicalstudies willbediscussed:oddsratio,relativerisk,prevalence,etc. R commandsallowingthe (pointwiseandinterval-based)estimationandtheassociatedhypothesistestswillbe illustratedwithcohortorcase-controlstudydata.Theimplementationofasimple logisticregressionmodelmakesitpossibletocompletetherangeofstatistical methodsthatallowthevariabilityobservedtobeexplainedatthelevelofbinary responsevariables.Theaimistounderstandwhich R commandstousewhenthe variablesarebinary,tosummarizeacontingencytableintheformofassociation indicatorsortomodeltherelationshipbetweenabinaryresponse(poor/healthy)and aqualitativeexplanatoryvariablefromso-calledgroupeddata.
Thefinalchapterprovidesanintroductiontotheanalysisofcensoreddata,tothe maintestsrelatedtotheconstructionofasurvivalcurve(log-rankorWilcoxontests) andfinallytotheCoxregressionmodel.Thespecificityofcensoreddatarequires particularcareinthecodingofdatain R andtheobjectiveistopresentthe R commandsessentialtothecorrectrepresentationofsurvivaldataindigitalform,to theirdigital(mediansurvival)andgraphical(Kaplan–Meiercurve)summary,andthe implementationofcommontestsofriskmeasures.
Attheendofeachchapter,afewapplicationsareprovidedwithafewexamplesof commandsthatcanbeusedtorespondtomostquestions.Itissometimespossibleto obtainidenticalresultsbyusingotherapproachesorothercommands.Outputsfrom R arenotreproducedbutthereaderisencouragedtotrytheproposed R instructions andalternativeoradditionalinstructions.Itwillbeassumedthatthedatafilesbeing usedareavailableintheworkingdirectory.Allofthedatafilesandthe R commands usedinthisbookcanbedownloadedfromhttps://github.com/biostatsante.
ThethreeappendiceswillhelpfamiliarizethereaderwithRStudio, lattice packagesforthemanagementofgraphicaloutputsand Hmisc and rms foradvanced datamanagementandmodeling.Theseappendicesareobviouslynotasubstitutefor
theworkofJohnVerzani,PaulMurrellandFranckHarrell[VER11,MUR05, HAR01].
Duetothedesignofthelayout,some R outputshavebeentruncatedor reformatted.Therefore,therecouldbedifferenceswhenthereaderattemptsto reproducethecommandsofthisbook.
Anindexofthe R commandsusedintheillustrationsisavailableattheendofthe book.
R ismorethanasimplesoftwareprogramforstatistics;itisalanguageforthe manipulationofstatisticaldata[IHA96,VEN02].Thispartlyexplainsitsdifficult nonuser-friendlyapproachforusersaccustomedtodrop-downmenussuchasthose offeredbySPSS(althoughSPSSalsooffersabasicmacrolanguage).Thischapter allowsthereadertodiscovertheelementsofthelanguageandtobecomefamiliarized withthemechanismsbywhichtorepresentstatisticaldatain R.Intheillustrations thatfollow,the R commandsareprefixedwiththesymbol >,whichdesignatesthe R consoleprompt.Itis,therefore,unnecessarytocopythissymboltotesttheproposed instructions.
1.1.Beforeproceeding
1.1.1.
InstallingR
Theinstallationof R isrelativelyeasyandinstructionscanbefoundatthe followingwebsite:http://cran.r-project.org.Thesoftwareprogramisavailablefor Windows,LinuxandMac.Inadditiontothe R program,theinstallerprovidesan R scripteditor,onlinehelpandasetofbasepackages.Thepackagesincludecommands specifictoacertainarea(graphicscommands,modelingcommands,etc.)andmakeit possibletoenhancethebasefeaturesof R.
1.1.2. RStudio
Although R issufficienttostartorperformstatisticalanalyses,RStudio(www. rstudio.com)providesaparticularlyenjoyableworkingenvironmentfor R.Itincludes apowerful R scripteditor,aconsoleinwhichtheusercanexecutethecommands (orsendthoseinsertedinthescripteditor),onlinehelp,agraphicsbrowser,adata
viewerandmuchmore[VER11].Abriefintroductiontothesoftwareisprovidedin theappendices.
1.1.3. Listofusefulpackages
Althoughthe R basepackagesallowformoststatisticalanalysesaddressedinthis book,anumberofadditionalpackagesmustbeinstalledtoreproducesomeofthe proposedapplications.Inadditiontothebasepackagesthatareusedthroughoutthis book(lattice, foreign, survival),etc.thefollowingpackagescanbeinstalled bymeansofthe install.packages() command,intheformat install.packages("reshape2"): reshape2, gridExtra, vcd (Chapter3); epiR, car (Chapter4); ppcor, psych (Chapter5); epicalc, ROCR (Chapter6).
ThesepackagescanbeinstalleddirectlyfromRStudiousingthepackage manager.Beforeinstallingapackage,itispossibletoconsultonlinehelponthe CRANwebsite(http://cran.r-project.org).TheCRANwebsitealsoofferstheability tosearchpackagesbytheme(“TaskViews”).
Toaccessthecommandsofapackage,thecommand library() canbeused, indicatingthenameofthepackage.
1.1.4. Findhelp
Aswithanypieceofsoftware,itisimportanttoknowhowtofindhelpona specificcommandorfromakeyword.Thecommands help() and help.search() or apropos() achievethesetwofunctionsin R.RStudioalsoprovidesanintegrated searchenginethatfacilitatesaccesstotheonlinehelppagestoagreatextent.
1.1.5. Rscripts
TheeasiestandrecommendedwaytotakeadvantageofthecapabilitiesofRStudio inthemanagementofastatisticalanalysisprojectistoenterthecommandsinthe scripteditorwhileensuringthatthesecommandsaresavedina R scriptfile.The commandsenteredintheRStudioeditorcanbesentdirectlytotheconsoleand R will displaytheresultinthesameconsole.Thisinteractiveapproachallowsacommand scripttobeincrementallybuiltandenablesthecorrectionoradjustmentofthecontrols asandwhenneeded.
Itisalsoimportanttodocumentthemainstagesofthestatisticalanalysisaswell ascommandsthatmaynotbecleartoanexternalreader.Infact,oneshouldalways
anticipatethatthe R commandscriptcouldbereusedinthefuturebyathirdperson. Commentsareusefulinthiscase.Youcanaddtheprefix # toalineoftextsothat canitbetreatedasacommentandnotasastatementby R.Thecommandsinthe R scriptshouldalsobeorganizedlogically,effectivelydistinguishingthecommands relatedtotheimportationofdata,tounivariateandbivariatedescriptivestatisticsand tostatisticalmodels,etc.Allthismustbepresentedinverydistinctsectionsandrelyas littleaspossibleonvariablesortablesoftemporarydatathatwouldsignificantlyalter theoriginaldatatable,withoutanydocumentationoranystoragetodisk.Theoriginal data,asamatteroffact,shouldalwaysbeaccessibleatanypointoftheanalysis. Intermediatedatatablescanbesavedseparately.Inthecaseoflargeanalysisprojects, itispreferabletocreateseveralscriptfilestostorethecommandscorrespondingto thedifferentstagesoftheanalysis.
1.2.DatarepresentationinR
1.2.1. Managementofnumericalvariables
Supposethatwehavetheweightoftennewbornsavailable(variablex,ingrams) andthatoftheirmother(variabley,inkg),asillustratedinTable1.1.
Table1.1. Artificialdataaboutweightsatbirth
Inthefollowingexample,avariablecalled x wascreatedinwhichtheobservations displayedinTable1.1arestoredintheformofasimplelistofnumbers:
>x<-c(2523,2551,2557,2594,2600,2622,2637,2637,2663,2665)
Thesymbol <- isbeingused(inpreferenceto =)toassociateaseriesof measurementstoadefinedvariable.Thesevaluescannowbedisplayedin R using the print(x) commandorbytypingthenameofthevariable:
>x
[1]2523255125572594260026222637263726632665
Wewillmakeuseofthesameprocedurewith y:
>y<-c(82.7,70.5,47.7,49.1,48.6,56.4,53.6,46.8,55.9,51.4)
Spaceshavenoimportancewhenalistofnumbersand,withrareexceptions,other expressionsareentered.Thedecimalseparatoriscompulsorilytheperiod,since R followstheEnglishnotationfornumbers.
Itcanbeverifiedthat R hasretainedtwovariablesinwhatiscalledtheworkspace withthehelpofthecommand ls():
>ls() [1]"x""y"
Therefore,itisalwayspossibletoaccessthesevariablesaslongasthe R sessionis notclosedandaslongasneitherofthesevariableshasbeenremovedinthemeantime.
Thenumberofobservationsforvariable x correspondstothelengthof x (number ofelements),whentherearenomissingvalues:
>length(x) [1]10
Thevaluesofavariable,orinmorestatisticalterms,thevaluestakenbyavariable forasampleofsize n,canbeselectedbyspecifyingtheindexornumberofthe observationbetweensquarebrackets.Section1.2.2willcoverhowitispossibleto selectmorethanoneobservationatatimeandhowobservationscanbeselectedonthe basisofanexternalcriterionandhowthisapproachcanbegeneralizedtotheselection ofasetofobservationsgatheredforthesamestatisticalunit.Thefifthobservationis obtainedinthefollowingmanner:
>x[5] [1]2600
Theobservationsareindexedfrom1to n,where n isthetotalnumberofitems inthevariableunderconsideration.Thefirstandthelastobservationswillthusbe obtainedusingthecommands x[1] and x[10]
1.2.2.
Operationswithanumericalvariable
Itshouldbenotedthattheweightsofthebabiesareexpressedingrams,while thoseofthemotherareinkg.Theweightofthe5thbabyinkgisobtainedusinga simplearithmeticdivisionoperation:
>x[5]/1000 [1]2.6
Thesamedivisioncanbeappliedtoalloftheobservations,withouthavingtocarry divisionoutforeachelementof x,asshowninthefollowingexample:
>x/1000
[1]2.5232.5512.5572.5942.6002.6222.6372.6372.6632.665
Thepreviousoperationdidnotchangethevaluesoftheoriginalvariable x,butit iseasytocreateanewvariableinwhichtheweightofthebabies,expressedinkg,will bestored.
>x2<-x/1000
>x
[1]2523255125572594260026222637263726632665
>x2
[1]2.5232.5512.5572.5942.6002.6222.6372.6372.6632.665
Itmightalsobedesirabletoreplacetheoldvaluesof x withthenewvalues.In thiscase,wecanreplacethestatement x2<-x/1000 with x<-x/1000 However,itwillnotbepossibletoreturntotheoldvaluesof x afterthisoperationis carriedout.Aswhendeletingavariablewiththecommand rm(),anyupdateofthe elementsofavariableisfinalforthesessioninprogress.
Inadditiontothebasicarithmeticoperators(addition,subtraction,multiplication anddivision), R offersmostofthefunctionsfoundinacalculator:logarithm, exponential,absolutevalue,etc.Therefore, log(x) wouldreturnthevaluesof x after transformationbythenaturallogarithm,whereasforthebase10logarithm, log10(x) shouldbeused.
Finally, R offersawiderangeofcommandsthataremorespecificallyoriented towardsdatastatisticalprocessing,forexampletosummarizethedistributionofthe observedvalues(centraltendency,dispersion,range,etc.).Thearithmeticmeanofthe valuesof x isobtainedwiththecommand mean():
>mean(x)
[1]2604.9
Othercommands,towhichwewillreturninChapter2,makeitpossibleforbasic informationtobeobtainedabouttheshapeofthedistributionofacontinuousvariable, suchastherangeorthevariance,forexample:
>range(x)
[1]25232665
>c(min(x),max(x))
[1]25232665
>var(x)
[1]2376.767
Ascanbeobservedinthepreviousinstructions,itisperfectlypossibletocombine twocommandsthatreturnthesametypeofresult,forexample c(min(x),max(x))
Asinthecaseofcreatingofalistofnumbers,thecommand c() allowstheresultsof commandsreturningnumberstobeassociatedwiththesamelist.
Thefollowingillustrationsarebasedonthesameideaasthedivisionoperation thatismentionedabove:eachoperation(squaringofthevaluesof x andcentering thesevaluesontheiraverage)isappliedonaper-elementbasis.
>sum(x)
[1]26049
>sum(x^2)
[1]67876431
>sum(x^2-mean(x))
[1]67850382
Inthepreviousexample, mean(x) isaconstant(whichdependsuponthedata butthatdoesnotvaryinthiscase)thatwesubtractfromeach x2 i (where i isthe observationnumberorindexinvariable x),whichamountsto11differentvaluesin total.Intheexpression x-x^2,ontheotherhand,thesquareof x issubtractedfrom eachelementof x (tenoperationsintotal,withtenpairsofdifferentnumberseach time).
Thecommands sort(), order() and rank() allowtheelementsofavariableto besortedorworkwiththeranksoftheobservations,thatistosaywiththeirposition inthevariable.
1.2.3. Managementofcategoricalvariables
Qualitativeorcategoricalvariableshavetheirowndistinctstatusinmoststatistical packages:theirmodalitiesarerepresentedbynumbers(1,2,3,etc.)butgenerally theyareassociatedwithidentifiers,called labels in R.Inthepreviousexample, dataabouttheweightofbabiesatbirth(x)andtheweightoftheirmother(y)were available.Supposethatitisalsoknownifthemotherwassmokingduringthefirst trimesterofherpregnancy,andletthisvariablebecalled z.Whenthemotherdidnot smokeduringthisperiod,thevariableisequalto1;whenthemotherwassmoking, thevariableisequalto2.Theaugmenteddataofvariable z arepresentedinTable1.2.
Table1.2. Augmentedartificialdataaboutweightsatbirth
Thedatawillbeenteredin R aswasdoneforthevariables x and y:
>z<-c(1,1,2,2,2,1,1,1,2,2)
>z
[1]1122211122
The factor(z) commandallowsthisnumericalvariablebeconvertedintoa categoricalvariable.Toassociatelabelstonumericalcodes,the labels= optionis used.The levels= optionenablesthenumericalcodesofthelevelsofthevariableto bespecified.Thevalue1willbeassociatedwiththeNSmodalityandthevalue2to theSmode:
>factor(z,levels=c(1,2),labels=c("NS","S")) [1]NSNSSSSNSNSNSSS
Levels:NSS
When R displaysthecontentsofvariable z,itiseffectivelyavariableofthe factor type,withtwolevels,thatareunorderedhereasNSandS.Someoperations, suchasthecalculationofthearithmeticmean,donotgenerallymakeanysensewith thistypeofvariableandthisiswhatalsohappenswith R.Simpleorcrossed tabulationoperationshoweverremainentirelyvalid.Anotheraspectimportant:the labelsassociatedwithaqualitativevariablecanberecoveredbyusingthe levels command,while nlevels returnsthetotalnumberofmodalities:
>z<-factor(z,labels=c("NS","S"))
>levels(z) [1]"NS""S" >nlevels(z) [1]2
1.2.4. Manipulationofcategoricalvariables
Itisoftennecessarytorecodea k -classqualitativevariableinto j<k classesor associate labels withthenumericalmodalitiesofaqualitativevariable.Forexample, supposethatdataonforcedexpiratoryvolumein1second(FEV1)(variable fev1) fromtenpatientsareprovidedintheformofathree-tierorderedvariable:critical
8BiostatisticsandComputer-basedAnalysisofHealthDatausingR
(1),low(2)andnormal(3).Here,the sample() commandwillbeused,whichallows randomdatatobegeneratedbyresamplingfromalistofvalues.Thisisan R command whichexpectsthreeoptions:thefirstshowsthelistofpermissiblevalues,thesecond thenumberofobservationstobegeneratedandthelastthetypeofrandomdrawthat issought:with(replace=TRUE)orwithout(replace=FALSE):
>fev1<-sample(1:3,10,replace=TRUE)
>fev1
[1]3213332212
Notethatthevaluesarerandomandarenotnecessarilyidenticalifusingthis commandinan R session.Tothisend,oneshouldusethe set.seed() commandto ensurethereproducibilityofthesesimulations.
First,thenumericalvalueswillbereplacedwiththelabelsdescribedpreviously(1 =critical,2=lowand3= normal):
>fev1<-factor(fev1,levels=c(1,2,3), labels=c("critical","low","normal"))
>fev1
[1]normallowcriticalnormalnormalnormallow [8]lowcriticallow
Levels:criticallownormal
Thedistributionofthecountsbyclasscanbeverifiedquicklyusingthe table() commandwhichenablessimpleorcross-tabulationsthatwillbediscussedinChapter 2:
>table(fev1)
fev1
criticallownormal 244
Iftheaimistorecodethethree-classvariable fev1 intoatwo-classvariableby aggregatingthefirsttwomodalitiesorlevels,the levels() commandcanbeusedas follows:
>levels(fev1) [1]"critical""low""normal"
>levels(fev1)[1:2]<-"criticalorlow"
>levels(fev1) [1]"criticalorlow""normal"
>table(fev1)
fev1
Itshouldbeobservedthatthemodalitiesofthevariable fev1 haveeffectively beenmodifiedandthefrequenciesofthefirsttwooriginallevelscannowbefoundin thefirstlevelas criticalorlow.Itisadvisabletoalwaysverifythattherecoding operationsofthelevelsofacategoricalvariablehavebeencorrectlycarriedoutas expected,usingthe levels() or table() commands.
1.3.Selectionofobservations
1.3.1. Index-basedselection
Consideringtheweightdatadiscussedabove,onecouldselectmorethanone observation,forexamplethethirdandthefifth:
>x[c(3,5)] [1]25572600
orfromthethirdtothefifth:
>x[3:5] [1]255725942600
Theslightlypeculiarsyntax 3:5 actuallydesignatesthesequenceofintegers startingat3andendingat5.Thisis,therefore,strictlyequivalentto c(3,4,5).Thus toaccessspecificvaluesofavariable,itsufficestoprovidealistofobservation numbers(orindex).Thesameindexingprincipleappliesofcoursetoqualitative variables:
>fev1[2] [1]criticalorlow
Levels:criticalorlownormal
1.3.2. Criterion-basedselection
Supposenowthatwewanttoobtaintheweightofbabieswhosemotherweighs lessthan50kg:
>x[y<50] [1]2557259426002637
Thepreviouscommandreads:selectthevaluesof x forwhichthe(logical) condition y<50 holds.Thisconditioncanbedirectlyvisualizedbytyping y<50 in R,whichyieldsthefollowingresult:
>y<50
[1]FALSEFALSETRUETRUETRUEFALSEFALSETRUEFALSEFALSE
Thisselectionprinciplebasedonanexternalcriterionremainsvalidwhenthe criterionisacategoricalvariable.Toobtainthelistofweightsofinfantswhose motherisanon-smoker,thefollowingcommandshouldbeentered:
>x[z=="NS"]
[1]25232551262226372637
Thedoubleequalsign(==)isbeingusedtodesignatethelogiccondition"z equal to NS".Finally,itispossibletocombinecriteriarelatingtovariablesofthesame typeorofdifferenttypesusingthelogicaloperators & (and)and | (or),mainly.The followingcommandreturnsthevaluesof x suchas z isequalto NS and y < 55 (weight ofbabieswhosemotherdoesnotsmokeandwhoweight55kgorless):
>x[z=="NS"&y<=55]
[1]26372637
1.4.Representationandprocessingofmissingvalues
Inreality,itisrareforfulldata(withoutmissingvalues)tobeavailable.Regardless ofthemannerdecidedtostatisticallyprocesssets,itisimportanttoensurethatthey arewellrepresentedassuchbythestatisticalsoftware.
Returningtothepreviousexample,supposethatthethirdobservationforthe weightofbabiesisactuallymissingorthatwewanttoprocessitassuch.Thisdatun isrepresentedbyadot(.)inTable1.3.
Table1.3. Artificialdataaboutweightatbirthincluding themissingdatum
Thismissingvaluecouldhavebeenrepresentedbythelackofvalue(suchasa blankcellinMicrosoftExcel)orbyanyothersymbol.Usingthesameapproachas whenvariable x issetforthefirsttime,awayofcapturingtheweightofthebabies wouldbetoreplacethemissingvaluewiththeterm NA,whichisthetermreservedby R toencodethemissingvalues:
>c(2523,2551,NA,2594,2600,2622,2637,2637,2663,2665)
Sincevariable x hasalreadybeenentered,itcansimplybeupdated,inthiscase replacingthethirdelementwith NA:
>x[3]<-NA
>x
[1]25232551NA2594260026222637263726632665
Careshouldbetaken,asthecommand length(x) alwaysreturns10:thereare effectivelytenobservations,butoneofthemismissing.Thecommand is.na() makesitpossibletoverifythemissingvaluesinavariable:
>is.na(x)
[1]FALSEFALSETRUEFALSEFALSEFALSEFALSEFALSEFALSEFALSE >which(is.na(x))
[1]3
Asseeninthecaseoftheobservationselection,thevaluesreturnedbythe command is.na() areBooleanvaluesequaltotrue(TRUE)whenthevalueof x is missing(NA),andfalse(FALSE)otherwise.Thiscommandwillthereforereturnas manyvaluesasvariable x contains.Tofacilitatetheidentificationofthenumberof missingobservations,itmaybepreferabletocombine is.na() withthecommand which().Itshouldbenotedthatcommandscanbecombinedinanintuitiveway:the which()command usestheresultreturnedby is.na(),withouthavingtocreatea dummyvariabletostoretheresult.Analternativeconsistsofcountingtheso-called completecaseswiththecommand complete.cases:
>complete.cases(x)
[1]TRUETRUEFALSETRUETRUETRUETRUETRUETRUETRUE
Usingtheprincipleofobservationselectionusedpreviously,thefollowing commandwillthusreturnthevaluesof y forwhich x hasanon-missingvalue:
>y[complete.cases(x)]
[1]82.770.549.148.656.453.646.855.951.4
1.5.Importingandstoringdata
1.5.1. Univariatedata
Thethreesetsoftenmeasuresdiscussedintheprecedingsectionswereofa sufficientlymoderatesizetoallowtheirdirectentryin R.However,inmostcases,the
datawillhavealreadybeenstoredinanexternalfileandthefirststepusuallyconsists ofimportingtheminto R.
Forexample,hereisthecontentsofthefile poids.dat,whichisasimpletextfile thatcanbevisualizedwithanytexteditor:
2523255125572594260026222637263726632665
Inasimplecasewhereonlyonesetofmeasurementshastobeimported,the scan() commandissufficient.Itindicatesthelocationofthedatafile.Theterm “location”designatesthepreciselocationofthefileinthetreestructureofthe computer’sfilesystem.Here,itwillbeassumedthattheworkingdirectory(editable withthe setwd() commandorusingRStudioutilities)isthecurrentdirectory:
>x<-scan("poids.dat")
Read10items >head(x,n=3) [1]252325512557
The head() commandenablesthefirstobservationsofavariabletobedisplayed. Theoption n= enablestheirnumbertobespecified(bydefault, n=6).
1.5.2. Multivariatedata
Inmorecomplex,morerealisticcases,dataaremultivariatewiththevariables generallyorganizedincolumnsandtheobservationsinlines[WIC14]:eachlineof thefile,therefore,representsastatisticalunitinwhichseveralmeasurementsordata points(variablesorfields)havebeencollected,thelatterbeingseparatedonefrom anotherbyacomma,semicolons,spacesortabs.Inallcases,thecommandtouseis read.table() (thefielddelimiterisaspaceoratab)oroneofitsshortcuts: read.csv() (comma-typeseparator)and read.csv2() (semicolon-typeseparator).
Considerthefile birthwt.dat,whosefirstthreelinesare:
019182200010 2523
033155300003 2551
020105110001 2557
Therearetencolumns(thatistenvariables)andeachvalueisseparatedfromthe nextbyaspace.Thenameofthevariablesdoesnotappearanywhere,butweknow thattheyarethefollowingvariables:weightstatusofthebabyatbirth low (=1if weight < 2 5 kg,0otherwise), age ofthemother(years), lwt weightofthemother(in pounds), race ethnicityofthemother(encodedinthreeclasses,1=white,2=black and3=other), smoke (=1ifconsumptionoftobaccoduringpregnancy,0otherwise),
ptl (numberofpreviousprematurelabours), ht (=1ifhistoryofhypertension,0 otherwise), ui (=1ifmanifestationofinteruterinepain,0otherwise), ftv (numberof consultationswithagynecologistduringthefirsttrimesterofthepregnancy), bwt for theweightofthebabiesatbirth(ingrams).Hereishowthesedatacanbeimportedin R,alwaysassumingthatthedatafileislocatedintheworkingdirectoryof R:
>bt<-read.table("birthwt.dat",header=FALSE) >varnames<-c("low","age","lwt","race","smoke","ptl","ht", "ui","ftv","bwt")
>names(bt)<-varnames
>head(bt)
lowagelwtracesmokeptlhtuiftvbwt
10191822000102523 20331553000032551
30201051100012557
40211081100122594 50181071100102600 60211243000002622
Ifthedatafilehadhadthefollowingform:
0,19,182,2,0,0,0,1,0,2523
0,33,155,3,0,0,0,0,3,2551
0,20,105,1,1,0,0,0,1,2557
whichistypicallytheexportformatofferedbyaspreadsheetprogramsuchasExcel, thenthecommand read.table() couldhavesimplybeenreplacedby read.csv() (orchangetheoption sep= of read.table()).
1.5.3. Storingthedatainanexternalfile
Intheunivariatecaseasinthemultivariatecase,thedataprocessedin R canbe exportedusing write.table() or write.csv() tocreatetextfilesidenticaltothose discussedabove.Forexample:
>write.csv(bt,file="bt.csv")
willsavethefileimportedwiththe read.table() commandintheformofafile wherethevaluesofthevariablesareseparatedbycommasanditwillbepossibleto openthe bt.csv filewithaspreadsheetprogramsuchasMicrosoftExcel.
Ifdatamustbedirectlysavedinthe R format, save(bt,file="bt.RData"), willhavetobeused, RData (or rda)constitutingtheextensionreservedfor R database files.
1.6.Multidimensionaldatamanagement
1.6.1. Constructionofastructureddatatable
Sofar,wehavecreatedandmanipulateddifferentvariables(x , y and z,mainly). Amorenaturalwaytoaccountforthisassociationbetweenobservationsistogather thesevariablesinsidethesamedatastructureortable,calleda“dataframe”in R.
ThegeneralstructureofadataframeisillustratedinFigure1.1.Thehypothetical dataareexpectedtobegatheredinadataframenamed a.Dataarecollectedinseven statisticalunits,arrangedinlines,andthefourvariablescorrespondtoanumerical score(score),thegenderoftheindividual(gender),intelligencequotient(IQ)and socio-economiclevel(SES).Adataframeisthusarectangulartableenablingthe presentationofvariablesofdifferenttypes(numbers,characters,factors,etc.)butof thesamesizeincolumns.Arowwilldesignateastatisticalunit.Thefirstunitthus hasascoreof1,itisofmalegender,hasanIQof92anditssocio-economicstatusis C.Thecolumnsofadataframemaybenamed,whichallowstobeaccessedthe variablesbytheirname.
Thisprinciplecanbeappliedtothevariables x, y and z usedinthepreceding sections.Thesethreevariablescanbestoredinthreeseparatecolumnsbyusingthe command data.frame().Hereishowonecanproceedin R:
>d<-data.frame(x,y,z)
>names(d)<-c("weight.baby","weight.mother","cig.mother") >head(d,n=3)
weight.babyweight.mothercig.mother 1252382.7NS
2255170.5NS
3255747.7S
Itshouldbenotedthatwetooktheopportunitytochangethenameoftheoriginal variables.Thefirstindividualthereforeweighs2,523gandhermotherwhoweighs 82.7kgdoesnotsmoke.Todisplayonlythisinformation,thetable,named d,willbe indexedbythecorrespondingrownumber,forexample:
>d[1,]
weight.babyweight.mothercig.mother 1252382.7NS
Ifnothingisspecifiedforthecolumnnumbers,itmeansthatonewishestoselect themall.Incontrast,todisplaythevalueofthesecondvariableforthefirsttwo observations,wewilluse:
>d[c(1,2),2]
[1]82.770.5
Insteadofusingthenumberofcolumns,itisperfectlypossibletousethenameof thevariable.Forexample,astatementsuchas d$weight.baby willreturnallvalues ofthe weight.baby variable,sothat d$weight.mother[1:2] wouldreturninthe firsttwoobservationsofthevariable weight.mother asinthepreviouscase.
1.6.2. Birthwtdata
Comprehensivedataontheweightofthebabiesdiscussedpreviouslyareincluded amongthebasicexamplesprovidedwith R andtheycanbeimportedusingthe data() command.Thedatacomefromanepidemiologicalstudycarriedoutinthe1980s [HOS89]toassesstheriskfactorsrelatingtolowbirthweightAmericanchildren. Thisdatasetwillbeusedsystematicallyinthefollowingchapters,withtheexception ofthatrelatingtotheanalysisofsurvivaldata(Chapter7).Onceimported,thisdata tableisavailableunderthename birthwt:
>data(birthwt,package="MASS")
>c(nrow(birthwt),ncol(birthwt)) [1]18910
>names(birthwt)
[1]"low""age""lwt""race""smoke""ptl""ht" [8]"ui""ftv""bwt"
>head(birthwt,n=2)
lowagelwtracesmokeptlhtuiftvbwt 850191822000102523 860331553000032551
The birthwt variableisatable(dataframe)comprisingtenvariables,each comprising189observations.Therefore,informationon189statisticalunitsare availableinthisprospectivestudy.Toaccessthevaluesofaparticularvariable,the followingnotationwillbeused:tablename(birthwt)followedbythenameofthe variableprefixedwiththedollar($)sign.Forexample,thefirstfivevaluesforthe weightofbabies(bwt)arethus:
>birthwt$bwt[1:5] [1]25232551255725942600
Themeaningofeachvariableisgivenabove.Itisknownthatsomevariablesare strictlybinary(0/1),suchas low, smoke, ht and ui,whiletheothervariablesare numeric,eitherwithdiscretevaluessuchas ftv orwithvaluesassumedtobe continuoussuchas lwt or bwt.Weknowthatinthecaseofbinaryvariables,avalue of1meansthatthesignispresent(themothersmokes,themotherhasahistoryof highbloodpressure,etc.).Itcostsnothingtoaddmoreinformativelabelstothese variables.Ethnicity(race)ismorespecificbecausethisisaqualitativevariablebut isactuallyprocessedby R asanumericalvariable:
>summary(birthwt$race) Min.1stQu.MedianMean3rdQu.Max. 1.0001.0001.0001.8473.0003.000
The summary() commandallowsthedistributionofnumericalandcategorical variablestobesummarized,andappliesequallytoasinglevariableastoadatatable. Awaytorecodethemodalitiesofthequalitativevariablesforthe birthwt tableisas follows:
>yesno<-c("No","Yes")
>birthwt$smoke<-factor(birthwt$smoke,labels=yesno)
>birthwt$race<-factor(birthwt$race,levels=c(1,2,3), labels=c("White","Black","Other"))
Thesameapproachwillbeusedfor low, ht, ui (usethelabels yesno),anditshall beverifiedthatthefinaldataarenowintheexpectedformatusing summary(),which makesitpossibletoquicklyidentifythevariablesconsiderednumerical(numerical summarywithmin,max,mean)orqualitative(valuestable):
>birthwt<-within(birthwt,{
low<-factor(low,labels=yesno)
ht<-factor(ht,labels=yesno)
ui<-factor(ui,labels=yesno) })
Ratherthanrepeatingthenameofthedataframefollowedbythenameofthe variableeachtime,operationshavebeenregroupedusingthe within() command.It isunnecessarytospecifythenumericcodes(levels=)forthesethreevariablessince theyareallbinary.
Atthispoint,hereiswhattherecodingofvariablesgiveswhenusingthe summary() commandwiththefirstfivevariables:
>summary(birthwt[,1:5])
low age lwt racesmoke No:130Min.:14.00Min.:80.0White:96No:115 Yes:591stQu.:19.001stQu.:110.0Black:26Yes:74
Median:23.00Median:121.0Other:67
Mean:23.24Mean:129.8
3rdQu.:26.003rdQu.:140.0
Max.:45.00Max.:250.0
Itisnowpossibletoanswerthefollowingquestions:whatistheaverageweightof womenwhosmokedduringtheirpregnancy?Howmanycasesofhypertensioncanbe identifiedinwomenweighingmorethan60kg(knowingthatthemeasurementsinthe tableareexpressedinpounds)?Whatistheminimumweightofbabieswithmothers whohavenotdemonstratedinteruterinepain?Thefollowingisapossiblesolutionfor eachofthesequestions:
>mean(birthwt$lwt[birthwt$smoke=="Yes"])##weightinpounds [1]128.1351
>table(birthwt$ht[birthwt$lwt/2.2>60])##weightinkg
NoYes
557
>min(birthwt$bwt[birthwt$ui=="No"]) [1]1135
Subsequently,ratherthanrepeatingthenameofthedataframefollowedby $ to accessoneormorevariables,asinthepreviousexample,itispreferabletousethe with() commandthatmakesitpossibletoindicateinwhichdataframethesearch forvariablesistobecarriedout:
>with(birthwt,lwt[1:5]) [1]182155105108107 >with(birthwt,mean(lwt[smoke=="Yes"])) [1]128.1351
1.7.Keypoints
–Avariablewillberepresentedintheformofalistofnumbersorcharacters (singleormultiple).
–Thevariableswillgenerallybearrangedincolumnsinarectangulararray(data frame).
–Theusualarithmeticoperationsoperateoneachelement(observation)ofa variable.
–Itispossibletoindexobservationsbytheirnumberorpositioninalistofvalues, orbyusinglogicalexpressions.
1.8.Goingfurther
ThebookbyPhilSpector[SPE08]providesadditionalinformationabout importingexternaldatasources(includingthecaseofrelationaldatabases)merging ofmultipledatasourceswiththe merge() command,handlingstringsandthe representationofdatesin R.Otherworksareavailableonline,freeofcostorfora moderateprice,forexample RProgrammingforDataScience byRogerPeng (https://leanpub.com/rprogramming)or TheElementsofDataAnalyticStyle byJeff Leek(https://leanpub.com/datastyle).
1.9.Applications
1)Plasmaviralloadisusedtodescribetheviralload(forexample,HIV)inablood sample.Thisviralmarkerthatallowstheprogressionofinfectionandeffectivenessof treatmentstobemeasuredwherethenumberofcopiespermilliliterarerecordedand mostmeasurementinstrumentshaveadetectabilitythresholdof50copies/ml.Here followsaseriesofmeasurements, X ,expressedinlogarithms(base10),collectedon 20patients:
3.642.271.431.774.623.041.012.143.025.625.515.511.01
Asareminder,aviralloadof100,000copies/mlisequivalentto5log.
Wewanttoanswerthefollowing:
a)Indicatehowmanypatientshaveaviralloadconsideredasnon-detectable.
b)Theresearcherrealizesthatthevalue3.04correspondstoadataentry errorandmustbechangedto3.64.Similarly,shehasadoubtabouttheseventh measurementanddecidestoconsideritasamissingvalue.Performthecorresponding transformations.
c)Whatisthemedianviralloadlevelincopies/ml,forthedataconsideredvalid?
Firstandforemost,itisnecessarytoexpressthedetectionlimit(50copies/ml)in logarithmicunits;thisis,infact,equalto:
>log10(50)
Next,weneedtofiltertheobservationsthatdonotverifythecondition X> 1 70 (theexactnumericresultwillbeused,nottheapproximatevalue):
>X<-c(3.64,2.27,1.43,1.77,4.62,3.04,1.01,2.14,3.02,5.62,5.51, 5.51,1.01,1.05,4.19,2.63,4.34,4.85,4.02,5.92)
>length(X[X<=log10(50)])
Toreplacetheobservationequalto3.04wewillperformasimplelogictestsothis observationisnotduplicatedintheobservationsseries:
>X[X==3.04]<-3.64
Concerningtheseventhobservation,wewillproceedinthesamemannerby updatingthevalueoftheobservationwiththecorrespondingindex:
>X[7]<-NA
Finally,themedianviralloadforpatientswithameasurementconsideredvalid canbecalculatedasfollows:
>Xc<-X[X>log10(50)] >round(median(10^Xc),0)
2)The dosage.txt filecontainsaseriesof15bioassays,storedinnumerical formatwiththreedecimalplacesasfollows: 6.3796.6835.120...
–use scan toreadthesedata(thoroughlyreadtheonlinehelpregardingtheuse ofthiscommand,particularlythe what= option);