Biostatistics and computer-based analysis of health data using r 1st edition christophe lalanne - Do by Education Libraries

BiostatisticsandComputer-basedAnalysisof HealthDatausingR1stEditionChristopheLalanne

https://ebookmass.com/product/biostatistics-and-computerbased-analysis-of-health-data-using-r-1st-edition-

christophe-lalanne/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson

https://ebookmass.com/product/exploratory-data-analysis-using-r-1stedition-ronald-k-pearson/ ebookmass.com

Analysis and Visualization of Discrete Data Using Neural Networks Koji Koyamada

https://ebookmass.com/product/analysis-and-visualization-of-discretedata-using-neural-networks-koji-koyamada/

ebookmass.com

Using R For Data Analysis In Social Sciences: A Research Project-oriented Approach Li

https://ebookmass.com/product/using-r-for-data-analysis-in-socialsciences-a-research-project-oriented-approach-li/

ebookmass.com

Changing perspective: An Ã¢ÂÂopticalÃ¢ÂÂ approach to creativity Stoyan V. Sgourev

https://ebookmass.com/product/changing-perspectivean-a%c2%a2aaopticala%c2%a2aa-approach-to-creativity-stoyan-v-sgourev/

ebookmass.com

Real

Estate Valuation and Strategy: A Guide for Family Offices and Their Advisors John A Kilpatrick

https://ebookmass.com/product/real-estate-valuation-and-strategy-aguide-for-family-offices-and-their-advisors-john-a-kilpatrick/

ebookmass.com

Language Disorders: A Functional Approach to Assessment and Intervention

https://ebookmass.com/product/language-disorders-a-functionalapproach-to-assessment-and-intervention/

ebookmass.com

Simulation Sheldon M. Ross

https://ebookmass.com/product/simulation-sheldon-m-ross/

ebookmass.com

Euroscepticism and the Future of European Integration De Vries

https://ebookmass.com/product/euroscepticism-and-the-future-ofeuropean-integration-de-vries/

ebookmass.com

Fortune Favors the Duke Kristin Vayden

https://ebookmass.com/product/fortune-favors-the-duke-kristinvayden-3/

ebookmass.com

https://ebookmass.com/product/human-resources-management-for-publicand-nonprofit-organizations-a/

ebookmass.com

Biostatistics and Computer-based Analysis of Health Data using R

Biostatistics and Health Science Set coordinated by Mounir Mesbah

Biostatistics and Computer-based Analysis of Health Data using R

Christophe Lalanne Mounir Mesbah

First published 2016 in Great Britain and the United States by ISTE Press Ltd and Elsevier Ltd

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Press Ltd

27-37 St George’s Road

Elsevier Ltd

The Boulevard, Langford Lane London SW19 4EU Kidlington, Oxford, OX5 1GB UK UK

www.iste.co.uk

www.elsevier.com

Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

For information on all our publications visit our website at http://store.elsevier.com/

The rights of Christophe Lalanne and Mounir Mesbah to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library Library of Congress Cataloging in Publication Data

A catalog record for this book is available from the Library of Congress

ISBN 978-1-78548-088-1

Printed and bound in the UK and US

Alargenumberoftheactionsperformedbymeansofstatisticalsoftwareamount tomanipulatingoreventotransformingdigitaldatarepresentingstatisticaldata literally.Itisthereforeparamountweunderstandhowstatisticaldataarerepresented andhowtheycanbeusedbysoftwaresuchas R.Afterimporting,recodingandthe eventualtransformationofthesedata,thedescriptionofthevariablesofinterestand thesummaryoftheirdistributioninnumericalandgraphicalformconstituteaprior andfundamentalsteptoanystatisticalmodeling,hencetheimportanceoftheseearly stagesinastatisticalanalysisproject.Inthesecondstep,itisessentialtofully controlthecommandsthatenablethecalculationofthemainmeasuresofassociation inmedicalresearchandtoknowhowtoimplementtheconventionalexplanatoryand predictivemodels:varianceanalysis,linearandlogisticregressionandCoxmodel. Withfewexceptions,usingthe R commandsavailableduringtheinstallationofthe software(basecommands)isfavoredovertheuseofspecializedcommandsin external R packages.Thepackagesthatmustbeinstalledtofollowtheapplications presentedinthisbookarelistedinChapter1,insection1.1.

Thisbookassumesthatthereaderisalreadyfamiliarwithbasicstatistical concepts,particularlythecalculationofcentraltendencyanddispersionindicators foracontinuousvariable,contingencytables,analysisofvarianceandconventional regressionmodels.Theobjectiveistoapplythisknowledgeusingdatasetsdescribed innumerousotherworks,eveniftheinterpretationoftheresultsremainsminimal, andfamiliarizeoneselfquicklywiththeuseof R withactualdata.Emphasisisgiven tothemanagementandmanipulationofstructureddata,asthisconstitutes60to80% oftheworkofthestatistician.TherearemanyworksinFrenchandinEnglishon R, bothfromthetechnicalandstatisticalpointofview.Someoftheseworksare orientedtowardsgeneralaspects[SHA12],othersaremuchmorespecialized [BIL14]oraddressmoreadvancedconcepts[HOT09].Thepurposeofthisbookis toenablethereadertogetaccustomedto R sothathecanperformhisownanalyses

andcontinuehisapprenticeshipinanautonomouswayintheﬁeldofmedical statistics.

InChapter1,thebasecommandsforthemanagementofdatawith R are introduced.Thefocusisonthecreationandmanipulationofquantitativeand qualitativevariables(recodingindividualfrequencies,countingmissing observations),importingdatabasesstoredintheformoftextﬁles,andbasic arithmeticoperations(minimum,maximum,arithmeticmean,difference,frequency, etc.).Wewillalsoconsiderhowtostorepreprocesseddatabasesintextorin R formats.Theobjectiveistounderstandhowthedataarerepresentedin R andhowto workwiththem.

Chapter2focusesonusefulcommandsforthedescriptionofadatatable comprisedofquantitativeorqualitativevariables.Thedescriptiveapproachisstrictly univariate,whichistheprerequisiteforanystatisticalapproach.Basicgraphic commands(histograms,densitycurves,barsorpointsdiagrams)willbepresentedin additiontotheusualcentraltendency(mean,median)anddispersion(variance, quartiles)descriptivesummaries.Pointwiseandintervalestimationusingmeansand empiricalproportionswillalsobeaddressed.Theobjectiveistobecomeacclimatized withtheuseofsimple R commandsoperatingonavariable,optionallyspecifying someoptionsforthecalculation,andalsowiththeselectionofstatisticalunits.

Chapter3isdevotedtothecomparisonoftwosamplesforquantitativeor qualitativemeasurements.Thefollowinghypothesistestsareaddressed:Student’s t-testforindependentorpairedsamples,non-parametricWilcoxontest, χ2 testand Fisher’sexacttest,McNemartest,fromthemainassociationmeasurementsfortwo variables(meandifference,oddsratioandrelativerisk).Fromthischapteron,there willbelessemphasisontheunivariatedescriptionofeachvariable,butitisadvisable tocarryoutexploratorydataanalysisasdiscussedinChapter2.Theobjectiveisto controlthemainstatisticaltestsinthecasewhereweareinterestedintherelationship betweenaquantitativevariableandaqualitativevariableorinthecaseoftwo qualitativevariables.

Chapter4isanintroductiontotheanalysisofvariance(ANOVA)inwhichthe objectiveistoexplainthevariabilityobservedatthelevelofanumericresponse variablebytakingintoaccountagrouporaclassiﬁcationfactorandmeandifferences intervalestimation.WewillfocusontheconstructionofanANOVAtable summarizingthevarioussourcesofvariabilityandonthegraphicmethodsallowing ustosummarizethedistributionofindividualoraggregateddata.Thelinear tendencytestwillalsobestudiedinthecasewheretheclassiﬁcationfactorcanbe consideredasbeingnaturallyordered.Theobjectiveistounderstandhowto constructanexplanatorymodelwherethereisoneoreventwoexplanatoryfactors andpresenttheresultsofsuchamodelthroughtheuseof R digitallyandgraphically.

Chapter5focusesontheanalysisofthelinearrelationshipbetweentwo continuousquantitativevariables.Inthelinearcorrelationapproach,whichassumesa symmetricalrelationbetweenthetwovariables,weareinterestedinthe quantiﬁcationofthemagnitudeanddirectionoftheassociationinaparametric (Pearsoncorrelation)ornon-parametricmanner(rank-basedSpearmancorrelation) andonthegraphicrepresentationofthisrelation.Simplelinearregressionwillbe usedintheeventthatoneofthetwonumericvariablesassumesthefunctionofa responsevariableandtheotheroneofanexplanatoryvariable.Theusefulcommands fortheestimationofthecoefﬁcientsoftheregressionline,theconstructionofthe ANOVAtableassociatedwiththeregressionandthepredictionwillbepresented. TheobjectiveofthischapterremainsidenticaltothatofChapter4,i.e.topresentthe R commandsnecessaryfortheconstructionofasimplestatisticalmodelwithtwo variablesfollowinganexplanatoryorpredictiveperspective.

InChapter6,themainmeasuresofassociationfoundinepidemiologicalstudies willbediscussed:oddsratio,relativerisk,prevalence,etc. R commandsallowingthe (pointwiseandinterval-based)estimationandtheassociatedhypothesistestswillbe illustratedwithcohortorcase-controlstudydata.Theimplementationofasimple logisticregressionmodelmakesitpossibletocompletetherangeofstatistical methodsthatallowthevariabilityobservedtobeexplainedatthelevelofbinary responsevariables.Theaimistounderstandwhich R commandstousewhenthe variablesarebinary,tosummarizeacontingencytableintheformofassociation indicatorsortomodeltherelationshipbetweenabinaryresponse(poor/healthy)and aqualitativeexplanatoryvariablefromso-calledgroupeddata.

Thefinalchapterprovidesanintroductiontotheanalysisofcensoreddata,tothe maintestsrelatedtotheconstructionofasurvivalcurve(log-rankorWilcoxontests) andfinallytotheCoxregressionmodel.Thespecificityofcensoreddatarequires particularcareinthecodingofdatain R andtheobjectiveistopresentthe R commandsessentialtothecorrectrepresentationofsurvivaldataindigitalform,to theirdigital(mediansurvival)andgraphical(Kaplan–Meiercurve)summary,andthe implementationofcommontestsofriskmeasures.

Attheendofeachchapter,afewapplicationsareprovidedwithafewexamplesof commandsthatcanbeusedtorespondtomostquestions.Itissometimespossibleto obtainidenticalresultsbyusingotherapproachesorothercommands.Outputsfrom R arenotreproducedbutthereaderisencouragedtotrytheproposed R instructions andalternativeoradditionalinstructions.Itwillbeassumedthatthedataﬁlesbeing usedareavailableintheworkingdirectory.Allofthedataﬁlesandthe R commands usedinthisbookcanbedownloadedfromhttps://github.com/biostatsante.

ThethreeappendiceswillhelpfamiliarizethereaderwithRStudio, lattice packagesforthemanagementofgraphicaloutputsand Hmisc and rms foradvanced datamanagementandmodeling.Theseappendicesareobviouslynotasubstitutefor

theworkofJohnVerzani,PaulMurrellandFranckHarrell[VER11,MUR05, HAR01].

Duetothedesignofthelayout,some R outputshavebeentruncatedor reformatted.Therefore,therecouldbedifferenceswhenthereaderattemptsto reproducethecommandsofthisbook.

Anindexofthe R commandsusedintheillustrationsisavailableattheendofthe book.

R ismorethanasimplesoftwareprogramforstatistics;itisalanguageforthe manipulationofstatisticaldata[IHA96,VEN02].Thispartlyexplainsitsdifﬁcult nonuser-friendlyapproachforusersaccustomedtodrop-downmenussuchasthose offeredbySPSS(althoughSPSSalsooffersabasicmacrolanguage).Thischapter allowsthereadertodiscovertheelementsofthelanguageandtobecomefamiliarized withthemechanismsbywhichtorepresentstatisticaldatain R.Intheillustrations thatfollow,the R commandsarepreﬁxedwiththesymbol >,whichdesignatesthe R consoleprompt.Itis,therefore,unnecessarytocopythissymboltotesttheproposed instructions.

1.1.Beforeproceeding

1.1.1.

InstallingR

Theinstallationof R isrelativelyeasyandinstructionscanbefoundatthe followingwebsite:http://cran.r-project.org.Thesoftwareprogramisavailablefor Windows,LinuxandMac.Inadditiontothe R program,theinstallerprovidesan R scripteditor,onlinehelpandasetofbasepackages.Thepackagesincludecommands speciﬁctoacertainarea(graphicscommands,modelingcommands,etc.)andmakeit possibletoenhancethebasefeaturesof R.

1.1.2. RStudio

Although R issufﬁcienttostartorperformstatisticalanalyses,RStudio(www. rstudio.com)providesaparticularlyenjoyableworkingenvironmentfor R.Itincludes apowerful R scripteditor,aconsoleinwhichtheusercanexecutethecommands (orsendthoseinsertedinthescripteditor),onlinehelp,agraphicsbrowser,adata

viewerandmuchmore[VER11].Abriefintroductiontothesoftwareisprovidedin theappendices.

1.1.3. Listofusefulpackages

Althoughthe R basepackagesallowformoststatisticalanalysesaddressedinthis book,anumberofadditionalpackagesmustbeinstalledtoreproducesomeofthe proposedapplications.Inadditiontothebasepackagesthatareusedthroughoutthis book(lattice, foreign, survival),etc.thefollowingpackagescanbeinstalled bymeansofthe install.packages() command,intheformat install.packages("reshape2"): reshape2, gridExtra, vcd (Chapter3); epiR, car (Chapter4); ppcor, psych (Chapter5); epicalc, ROCR (Chapter6).

ThesepackagescanbeinstalleddirectlyfromRStudiousingthepackage manager.Beforeinstallingapackage,itispossibletoconsultonlinehelponthe CRANwebsite(http://cran.r-project.org).TheCRANwebsitealsoofferstheability tosearchpackagesbytheme(“TaskViews”).

Toaccessthecommandsofapackage,thecommand library() canbeused, indicatingthenameofthepackage.

1.1.4. Findhelp

Aswithanypieceofsoftware,itisimportanttoknowhowtoﬁndhelpona speciﬁccommandorfromakeyword.Thecommands help() and help.search() or apropos() achievethesetwofunctionsin R.RStudioalsoprovidesanintegrated searchenginethatfacilitatesaccesstotheonlinehelppagestoagreatextent.

1.1.5. Rscripts

TheeasiestandrecommendedwaytotakeadvantageofthecapabilitiesofRStudio inthemanagementofastatisticalanalysisprojectistoenterthecommandsinthe scripteditorwhileensuringthatthesecommandsaresavedina R scriptﬁle.The commandsenteredintheRStudioeditorcanbesentdirectlytotheconsoleand R will displaytheresultinthesameconsole.Thisinteractiveapproachallowsacommand scripttobeincrementallybuiltandenablesthecorrectionoradjustmentofthecontrols asandwhenneeded.

Itisalsoimportanttodocumentthemainstagesofthestatisticalanalysisaswell ascommandsthatmaynotbecleartoanexternalreader.Infact,oneshouldalways

anticipatethatthe R commandscriptcouldbereusedinthefuturebyathirdperson. Commentsareusefulinthiscase.Youcanaddtheprefix # toalineoftextsothat canitbetreatedasacommentandnotasastatementby R.Thecommandsinthe R scriptshouldalsobeorganizedlogically,effectivelydistinguishingthecommands relatedtotheimportationofdata,tounivariateandbivariatedescriptivestatisticsand tostatisticalmodels,etc.Allthismustbepresentedinverydistinctsectionsandrelyas littleaspossibleonvariablesortablesoftemporarydatathatwouldsignificantlyalter theoriginaldatatable,withoutanydocumentationoranystoragetodisk.Theoriginal data,asamatteroffact,shouldalwaysbeaccessibleatanypointoftheanalysis. Intermediatedatatablescanbesavedseparately.Inthecaseoflargeanalysisprojects, itispreferabletocreateseveralscriptfilestostorethecommandscorrespondingto thedifferentstagesoftheanalysis.

1.2.DatarepresentationinR

1.2.1. Managementofnumericalvariables

Supposethatwehavetheweightoftennewbornsavailable(variablex,ingrams) andthatoftheirmother(variabley,inkg),asillustratedinTable1.1.

Table1.1. Artiﬁcialdataaboutweightsatbirth

Inthefollowingexample,avariablecalled x wascreatedinwhichtheobservations displayedinTable1.1arestoredintheformofasimplelistofnumbers:

>x<-c(2523,2551,2557,2594,2600,2622,2637,2637,2663,2665)

Thesymbol <- isbeingused(inpreferenceto =)toassociateaseriesof measurementstoadeﬁnedvariable.Thesevaluescannowbedisplayedin R using the print(x) commandorbytypingthenameofthevariable:

[1]2523255125572594260026222637263726632665

Wewillmakeuseofthesameprocedurewith y:

>y<-c(82.7,70.5,47.7,49.1,48.6,56.4,53.6,46.8,55.9,51.4)

Spaceshavenoimportancewhenalistofnumbersand,withrareexceptions,other expressionsareentered.Thedecimalseparatoriscompulsorilytheperiod,since R followstheEnglishnotationfornumbers.

Itcanbeveriﬁedthat R hasretainedtwovariablesinwhatiscalledtheworkspace withthehelpofthecommand ls():

>ls() [1]"x""y"

Therefore,itisalwayspossibletoaccessthesevariablesaslongasthe R sessionis notclosedandaslongasneitherofthesevariableshasbeenremovedinthemeantime.

Thenumberofobservationsforvariable x correspondstothelengthof x (number ofelements),whentherearenomissingvalues:

>length(x) [1]10

Thevaluesofavariable,orinmorestatisticalterms,thevaluestakenbyavariable forasampleofsize n,canbeselectedbyspecifyingtheindexornumberofthe observationbetweensquarebrackets.Section1.2.2willcoverhowitispossibleto selectmorethanoneobservationatatimeandhowobservationscanbeselectedonthe basisofanexternalcriterionandhowthisapproachcanbegeneralizedtotheselection ofasetofobservationsgatheredforthesamestatisticalunit.Theﬁfthobservationis obtainedinthefollowingmanner:

>x[5] [1]2600

Theobservationsareindexedfrom1to n,where n isthetotalnumberofitems inthevariableunderconsideration.Theﬁrstandthelastobservationswillthusbe obtainedusingthecommands x[1] and x[10]

1.2.2.

Operationswithanumericalvariable

Itshouldbenotedthattheweightsofthebabiesareexpressedingrams,while thoseofthemotherareinkg.Theweightofthe5thbabyinkgisobtainedusinga simplearithmeticdivisionoperation:

>x[5]/1000 [1]2.6

Thesamedivisioncanbeappliedtoalloftheobservations,withouthavingtocarry divisionoutforeachelementof x,asshowninthefollowingexample:

>x/1000

[1]2.5232.5512.5572.5942.6002.6222.6372.6372.6632.665

Thepreviousoperationdidnotchangethevaluesoftheoriginalvariable x,butit iseasytocreateanewvariableinwhichtheweightofthebabies,expressedinkg,will bestored.

>x2<-x/1000

[1]2523255125572594260026222637263726632665

>x2

[1]2.5232.5512.5572.5942.6002.6222.6372.6372.6632.665

Itmightalsobedesirabletoreplacetheoldvaluesof x withthenewvalues.In thiscase,wecanreplacethestatement x2<-x/1000 with x<-x/1000 However,itwillnotbepossibletoreturntotheoldvaluesof x afterthisoperationis carriedout.Aswhendeletingavariablewiththecommand rm(),anyupdateofthe elementsofavariableisﬁnalforthesessioninprogress.

Inadditiontothebasicarithmeticoperators(addition,subtraction,multiplication anddivision), R offersmostofthefunctionsfoundinacalculator:logarithm, exponential,absolutevalue,etc.Therefore, log(x) wouldreturnthevaluesof x after transformationbythenaturallogarithm,whereasforthebase10logarithm, log10(x) shouldbeused.

Finally, R offersawiderangeofcommandsthataremorespeciﬁcallyoriented towardsdatastatisticalprocessing,forexampletosummarizethedistributionofthe observedvalues(centraltendency,dispersion,range,etc.).Thearithmeticmeanofthe valuesof x isobtainedwiththecommand mean():

>mean(x)

[1]2604.9

Othercommands,towhichwewillreturninChapter2,makeitpossibleforbasic informationtobeobtainedabouttheshapeofthedistributionofacontinuousvariable, suchastherangeorthevariance,forexample:

>range(x)

[1]25232665

>c(min(x),max(x))

[1]25232665

>var(x)

[1]2376.767

Ascanbeobservedinthepreviousinstructions,itisperfectlypossibletocombine twocommandsthatreturnthesametypeofresult,forexample c(min(x),max(x))

Asinthecaseofcreatingofalistofnumbers,thecommand c() allowstheresultsof commandsreturningnumberstobeassociatedwiththesamelist.

Thefollowingillustrationsarebasedonthesameideaasthedivisionoperation thatismentionedabove:eachoperation(squaringofthevaluesof x andcentering thesevaluesontheiraverage)isappliedonaper-elementbasis.

>sum(x)

[1]26049

>sum(x^2)

[1]67876431

>sum(x^2-mean(x))

[1]67850382

Inthepreviousexample, mean(x) isaconstant(whichdependsuponthedata butthatdoesnotvaryinthiscase)thatwesubtractfromeach x2 i (where i isthe observationnumberorindexinvariable x),whichamountsto11differentvaluesin total.Intheexpression x-x^2,ontheotherhand,thesquareof x issubtractedfrom eachelementof x (tenoperationsintotal,withtenpairsofdifferentnumberseach time).

Thecommands sort(), order() and rank() allowtheelementsofavariableto besortedorworkwiththeranksoftheobservations,thatistosaywiththeirposition inthevariable.

1.2.3. Managementofcategoricalvariables

Qualitativeorcategoricalvariableshavetheirowndistinctstatusinmoststatistical packages:theirmodalitiesarerepresentedbynumbers(1,2,3,etc.)butgenerally theyareassociatedwithidentiﬁers,called labels in R.Inthepreviousexample, dataabouttheweightofbabiesatbirth(x)andtheweightoftheirmother(y)were available.Supposethatitisalsoknownifthemotherwassmokingduringtheﬁrst trimesterofherpregnancy,andletthisvariablebecalled z.Whenthemotherdidnot smokeduringthisperiod,thevariableisequalto1;whenthemotherwassmoking, thevariableisequalto2.Theaugmenteddataofvariable z arepresentedinTable1.2.

Table1.2. Augmentedartiﬁcialdataaboutweightsatbirth

Thedatawillbeenteredin R aswasdoneforthevariables x and y:

>z<-c(1,1,2,2,2,1,1,1,2,2)

[1]1122211122

The factor(z) commandallowsthisnumericalvariablebeconvertedintoa categoricalvariable.Toassociatelabelstonumericalcodes,the labels= optionis used.The levels= optionenablesthenumericalcodesofthelevelsofthevariableto bespeciﬁed.Thevalue1willbeassociatedwiththeNSmodalityandthevalue2to theSmode:

>factor(z,levels=c(1,2),labels=c("NS","S")) [1]NSNSSSSNSNSNSSS

Levels:NSS

When R displaysthecontentsofvariable z,itiseffectivelyavariableofthe factor type,withtwolevels,thatareunorderedhereasNSandS.Someoperations, suchasthecalculationofthearithmeticmean,donotgenerallymakeanysensewith thistypeofvariableandthisiswhatalsohappenswith R.Simpleorcrossed tabulationoperationshoweverremainentirelyvalid.Anotheraspectimportant:the labelsassociatedwithaqualitativevariablecanberecoveredbyusingthe levels command,while nlevels returnsthetotalnumberofmodalities:

>z<-factor(z,labels=c("NS","S"))

>levels(z) [1]"NS""S" >nlevels(z) [1]2

1.2.4. Manipulationofcategoricalvariables

Itisoftennecessarytorecodea k -classqualitativevariableinto j<k classesor associate labels withthenumericalmodalitiesofaqualitativevariable.Forexample, supposethatdataonforcedexpiratoryvolumein1second(FEV1)(variable fev1) fromtenpatientsareprovidedintheformofathree-tierorderedvariable:critical

8BiostatisticsandComputer-basedAnalysisofHealthDatausingR

(1),low(2)andnormal(3).Here,the sample() commandwillbeused,whichallows randomdatatobegeneratedbyresamplingfromalistofvalues.Thisisan R command whichexpectsthreeoptions:theﬁrstshowsthelistofpermissiblevalues,thesecond thenumberofobservationstobegeneratedandthelastthetypeofrandomdrawthat issought:with(replace=TRUE)orwithout(replace=FALSE):

>fev1<-sample(1:3,10,replace=TRUE)

>fev1

[1]3213332212

Notethatthevaluesarerandomandarenotnecessarilyidenticalifusingthis commandinan R session.Tothisend,oneshouldusethe set.seed() commandto ensurethereproducibilityofthesesimulations.

First,thenumericalvalueswillbereplacedwiththelabelsdescribedpreviously(1 =critical,2=lowand3= normal):

>fev1<-factor(fev1,levels=c(1,2,3), labels=c("critical","low","normal"))

>fev1

[1]normallowcriticalnormalnormalnormallow [8]lowcriticallow

Levels:criticallownormal

Thedistributionofthecountsbyclasscanbeveriﬁedquicklyusingthe table() commandwhichenablessimpleorcross-tabulationsthatwillbediscussedinChapter 2:

>table(fev1)

fev1

criticallownormal 244

Iftheaimistorecodethethree-classvariable fev1 intoatwo-classvariableby aggregatingtheﬁrsttwomodalitiesorlevels,the levels() commandcanbeusedas follows:

>levels(fev1) [1]"critical""low""normal"

>levels(fev1)[1:2]<-"criticalorlow"

>levels(fev1) [1]"criticalorlow""normal"

>table(fev1)

fev1

Itshouldbeobservedthatthemodalitiesofthevariable fev1 haveeffectively beenmodifiedandthefrequenciesofthefirsttwooriginallevelscannowbefoundin thefirstlevelas criticalorlow.Itisadvisabletoalwaysverifythattherecoding operationsofthelevelsofacategoricalvariablehavebeencorrectlycarriedoutas expected,usingthe levels() or table() commands.

1.3.Selectionofobservations

1.3.1. Index-basedselection

Consideringtheweightdatadiscussedabove,onecouldselectmorethanone observation,forexamplethethirdandtheﬁfth:

>x[c(3,5)] [1]25572600

orfromthethirdtotheﬁfth:

>x[3:5] [1]255725942600

Theslightlypeculiarsyntax 3:5 actuallydesignatesthesequenceofintegers startingat3andendingat5.Thisis,therefore,strictlyequivalentto c(3,4,5).Thus toaccessspeciﬁcvaluesofavariable,itsufﬁcestoprovidealistofobservation numbers(orindex).Thesameindexingprincipleappliesofcoursetoqualitative variables:

>fev1[2] [1]criticalorlow

Levels:criticalorlownormal

1.3.2. Criterion-basedselection

Supposenowthatwewanttoobtaintheweightofbabieswhosemotherweighs lessthan50kg:

>x[y<50] [1]2557259426002637

Thepreviouscommandreads:selectthevaluesof x forwhichthe(logical) condition y<50 holds.Thisconditioncanbedirectlyvisualizedbytyping y<50 in R,whichyieldsthefollowingresult:

>y<50

[1]FALSEFALSETRUETRUETRUEFALSEFALSETRUEFALSEFALSE

Thisselectionprinciplebasedonanexternalcriterionremainsvalidwhenthe criterionisacategoricalvariable.Toobtainthelistofweightsofinfantswhose motherisanon-smoker,thefollowingcommandshouldbeentered:

>x[z=="NS"]

[1]25232551262226372637

Thedoubleequalsign(==)isbeingusedtodesignatethelogiccondition"z equal to NS".Finally,itispossibletocombinecriteriarelatingtovariablesofthesame typeorofdifferenttypesusingthelogicaloperators & (and)and | (or),mainly.The followingcommandreturnsthevaluesof x suchas z isequalto NS and y < 55 (weight ofbabieswhosemotherdoesnotsmokeandwhoweight55kgorless):

>x[z=="NS"&y<=55]

[1]26372637

1.4.Representationandprocessingofmissingvalues

Inreality,itisrareforfulldata(withoutmissingvalues)tobeavailable.Regardless ofthemannerdecidedtostatisticallyprocesssets,itisimportanttoensurethatthey arewellrepresentedassuchbythestatisticalsoftware.

Returningtothepreviousexample,supposethatthethirdobservationforthe weightofbabiesisactuallymissingorthatwewanttoprocessitassuch.Thisdatun isrepresentedbyadot(.)inTable1.3.

Table1.3. Artiﬁcialdataaboutweightatbirthincluding themissingdatum

Thismissingvaluecouldhavebeenrepresentedbythelackofvalue(suchasa blankcellinMicrosoftExcel)orbyanyothersymbol.Usingthesameapproachas whenvariable x issetfortheﬁrsttime,awayofcapturingtheweightofthebabies wouldbetoreplacethemissingvaluewiththeterm NA,whichisthetermreservedby R toencodethemissingvalues:

>c(2523,2551,NA,2594,2600,2622,2637,2637,2663,2665)

Sincevariable x hasalreadybeenentered,itcansimplybeupdated,inthiscase replacingthethirdelementwith NA:

>x[3]<-NA

[1]25232551NA2594260026222637263726632665

Careshouldbetaken,asthecommand length(x) alwaysreturns10:thereare effectivelytenobservations,butoneofthemismissing.Thecommand is.na() makesitpossibletoverifythemissingvaluesinavariable:

>is.na(x)

[1]FALSEFALSETRUEFALSEFALSEFALSEFALSEFALSEFALSEFALSE >which(is.na(x))

[1]3

Asseeninthecaseoftheobservationselection,thevaluesreturnedbythe command is.na() areBooleanvaluesequaltotrue(TRUE)whenthevalueof x is missing(NA),andfalse(FALSE)otherwise.Thiscommandwillthereforereturnas manyvaluesasvariable x contains.Tofacilitatetheidentiﬁcationofthenumberof missingobservations,itmaybepreferabletocombine is.na() withthecommand which().Itshouldbenotedthatcommandscanbecombinedinanintuitiveway:the which()command usestheresultreturnedby is.na(),withouthavingtocreatea dummyvariabletostoretheresult.Analternativeconsistsofcountingtheso-called completecaseswiththecommand complete.cases:

>complete.cases(x)

[1]TRUETRUEFALSETRUETRUETRUETRUETRUETRUETRUE

Usingtheprincipleofobservationselectionusedpreviously,thefollowing commandwillthusreturnthevaluesof y forwhich x hasanon-missingvalue:

>y[complete.cases(x)]

[1]82.770.549.148.656.453.646.855.951.4

1.5.Importingandstoringdata

1.5.1. Univariatedata

Thethreesetsoftenmeasuresdiscussedintheprecedingsectionswereofa sufﬁcientlymoderatesizetoallowtheirdirectentryin R.However,inmostcases,the

datawillhavealreadybeenstoredinanexternalﬁleandtheﬁrststepusuallyconsists ofimportingtheminto R.

Forexample,hereisthecontentsoftheﬁle poids.dat,whichisasimpletextﬁle thatcanbevisualizedwithanytexteditor:

2523255125572594260026222637263726632665

Inasimplecasewhereonlyonesetofmeasurementshastobeimported,the scan() commandissufficient.Itindicatesthelocationofthedatafile.Theterm “location”designatesthepreciselocationofthefileinthetreestructureofthe computer’sfilesystem.Here,itwillbeassumedthattheworkingdirectory(editable withthe setwd() commandorusingRStudioutilities)isthecurrentdirectory:

>x<-scan("poids.dat")

Read10items >head(x,n=3) [1]252325512557

The head() commandenablestheﬁrstobservationsofavariabletobedisplayed. Theoption n= enablestheirnumbertobespeciﬁed(bydefault, n=6).

1.5.2. Multivariatedata

Inmorecomplex,morerealisticcases,dataaremultivariatewiththevariables generallyorganizedincolumnsandtheobservationsinlines[WIC14]:eachlineof thefile,therefore,representsastatisticalunitinwhichseveralmeasurementsordata points(variablesorfields)havebeencollected,thelatterbeingseparatedonefrom anotherbyacomma,semicolons,spacesortabs.Inallcases,thecommandtouseis read.table() (thefielddelimiterisaspaceoratab)oroneofitsshortcuts: read.csv() (comma-typeseparator)and read.csv2() (semicolon-typeseparator).

Considertheﬁle birthwt.dat,whoseﬁrstthreelinesare:

019182200010 2523

033155300003 2551

020105110001 2557

Therearetencolumns(thatistenvariables)andeachvalueisseparatedfromthe nextbyaspace.Thenameofthevariablesdoesnotappearanywhere,butweknow thattheyarethefollowingvariables:weightstatusofthebabyatbirth low (=1if weight < 2 5 kg,0otherwise), age ofthemother(years), lwt weightofthemother(in pounds), race ethnicityofthemother(encodedinthreeclasses,1=white,2=black and3=other), smoke (=1ifconsumptionoftobaccoduringpregnancy,0otherwise),

ptl (numberofpreviousprematurelabours), ht (=1ifhistoryofhypertension,0 otherwise), ui (=1ifmanifestationofinteruterinepain,0otherwise), ftv (numberof consultationswithagynecologistduringtheﬁrsttrimesterofthepregnancy), bwt for theweightofthebabiesatbirth(ingrams).Hereishowthesedatacanbeimportedin R,alwaysassumingthatthedataﬁleislocatedintheworkingdirectoryof R:

>bt<-read.table("birthwt.dat",header=FALSE) >varnames<-c("low","age","lwt","race","smoke","ptl","ht", "ui","ftv","bwt")

>names(bt)<-varnames

>head(bt)

lowagelwtracesmokeptlhtuiftvbwt

10191822000102523 20331553000032551

30201051100012557

40211081100122594 50181071100102600 60211243000002622

Ifthedataﬁlehadhadthefollowingform:

0,19,182,2,0,0,0,1,0,2523

0,33,155,3,0,0,0,0,3,2551

0,20,105,1,1,0,0,0,1,2557

whichistypicallytheexportformatofferedbyaspreadsheetprogramsuchasExcel, thenthecommand read.table() couldhavesimplybeenreplacedby read.csv() (orchangetheoption sep= of read.table()).

1.5.3. Storingthedatainanexternalﬁle

Intheunivariatecaseasinthemultivariatecase,thedataprocessedin R canbe exportedusing write.table() or write.csv() tocreatetextﬁlesidenticaltothose discussedabove.Forexample:

>write.csv(bt,file="bt.csv")

willsavethefileimportedwiththe read.table() commandintheformofafile wherethevaluesofthevariablesareseparatedbycommasanditwillbepossibleto openthe bt.csv filewithaspreadsheetprogramsuchasMicrosoftExcel.

Ifdatamustbedirectlysavedinthe R format, save(bt,file="bt.RData"), willhavetobeused, RData (or rda)constitutingtheextensionreservedfor R database ﬁles.

1.6.Multidimensionaldatamanagement

1.6.1. Constructionofastructureddatatable

Sofar,wehavecreatedandmanipulateddifferentvariables(x , y and z,mainly). Amorenaturalwaytoaccountforthisassociationbetweenobservationsistogather thesevariablesinsidethesamedatastructureortable,calleda“dataframe”in R.

ThegeneralstructureofadataframeisillustratedinFigure1.1.Thehypothetical dataareexpectedtobegatheredinadataframenamed a.Dataarecollectedinseven statisticalunits,arrangedinlines,andthefourvariablescorrespondtoanumerical score(score),thegenderoftheindividual(gender),intelligencequotient(IQ)and socio-economiclevel(SES).Adataframeisthusarectangulartableenablingthe presentationofvariablesofdifferenttypes(numbers,characters,factors,etc.)butof thesamesizeincolumns.Arowwilldesignateastatisticalunit.Theﬁrstunitthus hasascoreof1,itisofmalegender,hasanIQof92anditssocio-economicstatusis C.Thecolumnsofadataframemaybenamed,whichallowstobeaccessedthe variablesbytheirname.

Figure1.1. Structureofadataframe

Thisprinciplecanbeappliedtothevariables x, y and z usedinthepreceding sections.Thesethreevariablescanbestoredinthreeseparatecolumnsbyusingthe command data.frame().Hereishowonecanproceedin R:

>d<-data.frame(x,y,z)

>names(d)<-c("weight.baby","weight.mother","cig.mother") >head(d,n=3)

weight.babyweight.mothercig.mother 1252382.7NS

2255170.5NS

3255747.7S

Itshouldbenotedthatwetooktheopportunitytochangethenameoftheoriginal variables.Theﬁrstindividualthereforeweighs2,523gandhermotherwhoweighs 82.7kgdoesnotsmoke.Todisplayonlythisinformation,thetable,named d,willbe indexedbythecorrespondingrownumber,forexample:

>d[1,]

weight.babyweight.mothercig.mother 1252382.7NS

Ifnothingisspeciﬁedforthecolumnnumbers,itmeansthatonewishestoselect themall.Incontrast,todisplaythevalueofthesecondvariablefortheﬁrsttwo observations,wewilluse:

>d[c(1,2),2]

[1]82.770.5

Insteadofusingthenumberofcolumns,itisperfectlypossibletousethenameof thevariable.Forexample,astatementsuchas d$weight.baby willreturnallvalues ofthe weight.baby variable,sothat d$weight.mother[1:2] wouldreturninthe ﬁrsttwoobservationsofthevariable weight.mother asinthepreviouscase.

1.6.2. Birthwtdata

Comprehensivedataontheweightofthebabiesdiscussedpreviouslyareincluded amongthebasicexamplesprovidedwith R andtheycanbeimportedusingthe data() command.Thedatacomefromanepidemiologicalstudycarriedoutinthe1980s [HOS89]toassesstheriskfactorsrelatingtolowbirthweightAmericanchildren. Thisdatasetwillbeusedsystematicallyinthefollowingchapters,withtheexception ofthatrelatingtotheanalysisofsurvivaldata(Chapter7).Onceimported,thisdata tableisavailableunderthename birthwt:

>data(birthwt,package="MASS")

>c(nrow(birthwt),ncol(birthwt)) [1]18910

>names(birthwt)

[1]"low""age""lwt""race""smoke""ptl""ht" [8]"ui""ftv""bwt"

>head(birthwt,n=2)

lowagelwtracesmokeptlhtuiftvbwt 850191822000102523 860331553000032551

The birthwt variableisatable(dataframe)comprisingtenvariables,each comprising189observations.Therefore,informationon189statisticalunitsare availableinthisprospectivestudy.Toaccessthevaluesofaparticularvariable,the followingnotationwillbeused:tablename(birthwt)followedbythenameofthe variableprefixedwiththedollar($)sign.Forexample,thefirstfivevaluesforthe weightofbabies(bwt)arethus:

>birthwt$bwt[1:5] [1]25232551255725942600

Themeaningofeachvariableisgivenabove.Itisknownthatsomevariablesare strictlybinary(0/1),suchas low, smoke, ht and ui,whiletheothervariablesare numeric,eitherwithdiscretevaluessuchas ftv orwithvaluesassumedtobe continuoussuchas lwt or bwt.Weknowthatinthecaseofbinaryvariables,avalue of1meansthatthesignispresent(themothersmokes,themotherhasahistoryof highbloodpressure,etc.).Itcostsnothingtoaddmoreinformativelabelstothese variables.Ethnicity(race)ismorespeciﬁcbecausethisisaqualitativevariablebut isactuallyprocessedby R asanumericalvariable:

>summary(birthwt$race) Min.1stQu.MedianMean3rdQu.Max. 1.0001.0001.0001.8473.0003.000

The summary() commandallowsthedistributionofnumericalandcategorical variablestobesummarized,andappliesequallytoasinglevariableastoadatatable. Awaytorecodethemodalitiesofthequalitativevariablesforthe birthwt tableisas follows:

>yesno<-c("No","Yes")

>birthwt$smoke<-factor(birthwt$smoke,labels=yesno)

>birthwt$race<-factor(birthwt$race,levels=c(1,2,3), labels=c("White","Black","Other"))

Thesameapproachwillbeusedfor low, ht, ui (usethelabels yesno),anditshall beveriﬁedthattheﬁnaldataarenowintheexpectedformatusing summary(),which makesitpossibletoquicklyidentifythevariablesconsiderednumerical(numerical summarywithmin,max,mean)orqualitative(valuestable):

>birthwt<-within(birthwt,{

low<-factor(low,labels=yesno)

ht<-factor(ht,labels=yesno)

ui<-factor(ui,labels=yesno) })

Ratherthanrepeatingthenameofthedataframefollowedbythenameofthe variableeachtime,operationshavebeenregroupedusingthe within() command.It isunnecessarytospecifythenumericcodes(levels=)forthesethreevariablessince theyareallbinary.

Atthispoint,hereiswhattherecodingofvariablesgiveswhenusingthe summary() commandwiththeﬁrstﬁvevariables:

>summary(birthwt[,1:5])

low age lwt racesmoke No:130Min.:14.00Min.:80.0White:96No:115 Yes:591stQu.:19.001stQu.:110.0Black:26Yes:74

Median:23.00Median:121.0Other:67

Mean:23.24Mean:129.8

3rdQu.:26.003rdQu.:140.0

Max.:45.00Max.:250.0

Itisnowpossibletoanswerthefollowingquestions:whatistheaverageweightof womenwhosmokedduringtheirpregnancy?Howmanycasesofhypertensioncanbe identiﬁedinwomenweighingmorethan60kg(knowingthatthemeasurementsinthe tableareexpressedinpounds)?Whatistheminimumweightofbabieswithmothers whohavenotdemonstratedinteruterinepain?Thefollowingisapossiblesolutionfor eachofthesequestions:

>mean(birthwt$lwt[birthwt$smoke=="Yes"])##weightinpounds [1]128.1351

>table(birthwt$ht[birthwt$lwt/2.2>60])##weightinkg

NoYes

557

>min(birthwt$bwt[birthwt$ui=="No"]) [1]1135

Subsequently,ratherthanrepeatingthenameofthedataframefollowedby $ to accessoneormorevariables,asinthepreviousexample,itispreferabletousethe with() commandthatmakesitpossibletoindicateinwhichdataframethesearch forvariablesistobecarriedout:

>with(birthwt,lwt[1:5]) [1]182155105108107 >with(birthwt,mean(lwt[smoke=="Yes"])) [1]128.1351

1.7.Keypoints

–Avariablewillberepresentedintheformofalistofnumbersorcharacters (singleormultiple).

–Thevariableswillgenerallybearrangedincolumnsinarectangulararray(data frame).

–Theusualarithmeticoperationsoperateoneachelement(observation)ofa variable.

–Itispossibletoindexobservationsbytheirnumberorpositioninalistofvalues, orbyusinglogicalexpressions.

1.8.Goingfurther

ThebookbyPhilSpector[SPE08]providesadditionalinformationabout importingexternaldatasources(includingthecaseofrelationaldatabases)merging ofmultipledatasourceswiththe merge() command,handlingstringsandthe representationofdatesin R.Otherworksareavailableonline,freeofcostorfora moderateprice,forexample RProgrammingforDataScience byRogerPeng (https://leanpub.com/rprogramming)or TheElementsofDataAnalyticStyle byJeff Leek(https://leanpub.com/datastyle).

1.9.Applications

1)Plasmaviralloadisusedtodescribetheviralload(forexample,HIV)inablood sample.Thisviralmarkerthatallowstheprogressionofinfectionandeffectivenessof treatmentstobemeasuredwherethenumberofcopiespermilliliterarerecordedand mostmeasurementinstrumentshaveadetectabilitythresholdof50copies/ml.Here followsaseriesofmeasurements, X ,expressedinlogarithms(base10),collectedon 20patients:

3.642.271.431.774.623.041.012.143.025.625.515.511.01

Asareminder,aviralloadof100,000copies/mlisequivalentto5log.

Wewanttoanswerthefollowing:

a)Indicatehowmanypatientshaveaviralloadconsideredasnon-detectable.

b)Theresearcherrealizesthatthevalue3.04correspondstoadataentry errorandmustbechangedto3.64.Similarly,shehasadoubtabouttheseventh measurementanddecidestoconsideritasamissingvalue.Performthecorresponding transformations.

c)Whatisthemedianviralloadlevelincopies/ml,forthedataconsideredvalid?

Firstandforemost,itisnecessarytoexpressthedetectionlimit(50copies/ml)in logarithmicunits;thisis,infact,equalto:

>log10(50)

Next,weneedtoﬁltertheobservationsthatdonotverifythecondition X> 1 70 (theexactnumericresultwillbeused,nottheapproximatevalue):

>X<-c(3.64,2.27,1.43,1.77,4.62,3.04,1.01,2.14,3.02,5.62,5.51, 5.51,1.01,1.05,4.19,2.63,4.34,4.85,4.02,5.92)

>length(X[X<=log10(50)])

Toreplacetheobservationequalto3.04wewillperformasimplelogictestsothis observationisnotduplicatedintheobservationsseries:

>X[X==3.04]<-3.64

Concerningtheseventhobservation,wewillproceedinthesamemannerby updatingthevalueoftheobservationwiththecorrespondingindex:

>X[7]<-NA

Finally,themedianviralloadforpatientswithameasurementconsideredvalid canbecalculatedasfollows:

>Xc<-X[X>log10(50)] >round(median(10^Xc),0)

2)The dosage.txt ﬁlecontainsaseriesof15bioassays,storedinnumerical formatwiththreedecimalplacesasfollows: 6.3796.6835.120...

–use scan toreadthesedata(thoroughlyreadtheonlinehelpregardingtheuse ofthiscommand,particularlythe what= option);