Biostatistics and computer-based analysis of health data using r 1st edition christophe lalanne - Do by Education Libraries

Real

Estate Valuation and Strategy: A Guide for Family Offices and Their Advisors John A Kilpatrick

https://ebookmass.com/product/real-estate-valuation-and-strategy-aguide-for-family-offices-and-their-advisors-john-a-kilpatrick/

ebookmass.com

Language Disorders: A Functional Approach to Assessment and Intervention

https://ebookmass.com/product/language-disorders-a-functionalapproach-to-assessment-and-intervention/

ebookmass.com

Simulation Sheldon M. Ross

https://ebookmass.com/product/simulation-sheldon-m-ross/

ebookmass.com

Euroscepticism and the Future of European Integration De Vries

https://ebookmass.com/product/euroscepticism-and-the-future-ofeuropean-integration-de-vries/

ebookmass.com

Fortune Favors the Duke Kristin Vayden

https://ebookmass.com/product/fortune-favors-the-duke-kristinvayden-3/

ebookmass.com

Biostatistics and Computer-based Analysis of Health Data using R

Biostatistics and Health Science Set coordinated by Mounir Mesbah

Biostatistics and Computer-based Analysis of Health Data using R

Christophe Lalanne Mounir Mesbah

First published 2016 in Great Britain and the United States by ISTE Press Ltd and Elsevier Ltd

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Press Ltd

27-37 St George’s Road

Elsevier Ltd

The Boulevard, Langford Lane London SW19 4EU Kidlington, Oxford, OX5 1GB UK UK

www.iste.co.uk

www.elsevier.com

Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

For information on all our publications visit our website at http://store.elsevier.com/

The rights of Christophe Lalanne and Mounir Mesbah to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library Library of Congress Cataloging in Publication Data

A catalog record for this book is available from the Library of Congress

ISBN 978-1-78548-088-1

Printed and bound in the UK and US

Alargenumberoftheactionsperformedbymeansofstatisticalsoftwareamount tomanipulatingoreventotransformingdigitaldatarepresentingstatisticaldata literally.Itisthereforeparamountweunderstandhowstatisticaldataarerepresented andhowtheycanbeusedbysoftwaresuchas R.Afterimporting,recodingandthe eventualtransformationofthesedata,thedescriptionofthevariablesofinterestand thesummaryoftheirdistributioninnumericalandgraphicalformconstituteaprior andfundamentalsteptoanystatisticalmodeling,hencetheimportanceoftheseearly stagesinastatisticalanalysisproject.Inthesecondstep,itisessentialtofully controlthecommandsthatenablethecalculationofthemainmeasuresofassociation inmedicalresearchandtoknowhowtoimplementtheconventionalexplanatoryand predictivemodels:varianceanalysis,linearandlogisticregressionandCoxmodel. Withfewexceptions,usingthe R commandsavailableduringtheinstallationofthe software(basecommands)isfavoredovertheuseofspecializedcommandsin external R packages.Thepackagesthatmustbeinstalledtofollowtheapplications presentedinthisbookarelistedinChapter1,insection1.1.

Thisbookassumesthatthereaderisalreadyfamiliarwithbasicstatistical concepts,particularlythecalculationofcentraltendencyanddispersionindicators foracontinuousvariable,contingencytables,analysisofvarianceandconventional regressionmodels.Theobjectiveistoapplythisknowledgeusingdatasetsdescribed innumerousotherworks,eveniftheinterpretationoftheresultsremainsminimal, andfamiliarizeoneselfquicklywiththeuseof R withactualdata.Emphasisisgiven tothemanagementandmanipulationofstructureddata,asthisconstitutes60to80% oftheworkofthestatistician.TherearemanyworksinFrenchandinEnglishon R, bothfromthetechnicalandstatisticalpointofview.Someoftheseworksare orientedtowardsgeneralaspects[SHA12],othersaremuchmorespecialized [BIL14]oraddressmoreadvancedconcepts[HOT09].Thepurposeofthisbookis toenablethereadertogetaccustomedto R sothathecanperformhisownanalyses

theworkofJohnVerzani,PaulMurrellandFranckHarrell[VER11,MUR05, HAR01].

Duetothedesignofthelayout,some R outputshavebeentruncatedor reformatted.Therefore,therecouldbedifferenceswhenthereaderattemptsto reproducethecommandsofthisbook.

Anindexofthe R commandsusedintheillustrationsisavailableattheendofthe book.

R ismorethanasimplesoftwareprogramforstatistics;itisalanguageforthe manipulationofstatisticaldata[IHA96,VEN02].Thispartlyexplainsitsdifﬁcult nonuser-friendlyapproachforusersaccustomedtodrop-downmenussuchasthose offeredbySPSS(althoughSPSSalsooffersabasicmacrolanguage).Thischapter allowsthereadertodiscovertheelementsofthelanguageandtobecomefamiliarized withthemechanismsbywhichtorepresentstatisticaldatain R.Intheillustrations thatfollow,the R commandsarepreﬁxedwiththesymbol >,whichdesignatesthe R consoleprompt.Itis,therefore,unnecessarytocopythissymboltotesttheproposed instructions.

1.1.Beforeproceeding

1.1.1.

InstallingR

Theinstallationof R isrelativelyeasyandinstructionscanbefoundatthe followingwebsite:http://cran.r-project.org.Thesoftwareprogramisavailablefor Windows,LinuxandMac.Inadditiontothe R program,theinstallerprovidesan R scripteditor,onlinehelpandasetofbasepackages.Thepackagesincludecommands speciﬁctoacertainarea(graphicscommands,modelingcommands,etc.)andmakeit possibletoenhancethebasefeaturesof R.

1.1.2. RStudio

Although R issufﬁcienttostartorperformstatisticalanalyses,RStudio(www. rstudio.com)providesaparticularlyenjoyableworkingenvironmentfor R.Itincludes apowerful R scripteditor,aconsoleinwhichtheusercanexecutethecommands (orsendthoseinsertedinthescripteditor),onlinehelp,agraphicsbrowser,adata

anticipatethatthe R commandscriptcouldbereusedinthefuturebyathirdperson. Commentsareusefulinthiscase.Youcanaddtheprefix # toalineoftextsothat canitbetreatedasacommentandnotasastatementby R.Thecommandsinthe R scriptshouldalsobeorganizedlogically,effectivelydistinguishingthecommands relatedtotheimportationofdata,tounivariateandbivariatedescriptivestatisticsand tostatisticalmodels,etc.Allthismustbepresentedinverydistinctsectionsandrelyas littleaspossibleonvariablesortablesoftemporarydatathatwouldsignificantlyalter theoriginaldatatable,withoutanydocumentationoranystoragetodisk.Theoriginal data,asamatteroffact,shouldalwaysbeaccessibleatanypointoftheanalysis. Intermediatedatatablescanbesavedseparately.Inthecaseoflargeanalysisprojects, itispreferabletocreateseveralscriptfilestostorethecommandscorrespondingto thedifferentstagesoftheanalysis.

1.2.DatarepresentationinR

1.2.1. Managementofnumericalvariables

Supposethatwehavetheweightoftennewbornsavailable(variablex,ingrams) andthatoftheirmother(variabley,inkg),asillustratedinTable1.1.

Table1.1. Artiﬁcialdataaboutweightsatbirth

Inthefollowingexample,avariablecalled x wascreatedinwhichtheobservations displayedinTable1.1arestoredintheformofasimplelistofnumbers:

>x<-c(2523,2551,2557,2594,2600,2622,2637,2637,2663,2665)

Thesymbol <- isbeingused(inpreferenceto =)toassociateaseriesof measurementstoadeﬁnedvariable.Thesevaluescannowbedisplayedin R using the print(x) commandorbytypingthenameofthevariable:

[1]2523255125572594260026222637263726632665

Wewillmakeuseofthesameprocedurewith y:

>y<-c(82.7,70.5,47.7,49.1,48.6,56.4,53.6,46.8,55.9,51.4)

>var(x)

[1]2376.767

Ascanbeobservedinthepreviousinstructions,itisperfectlypossibletocombine twocommandsthatreturnthesametypeofresult,forexample c(min(x),max(x))

Asinthecaseofcreatingofalistofnumbers,thecommand c() allowstheresultsof commandsreturningnumberstobeassociatedwiththesamelist.

Thefollowingillustrationsarebasedonthesameideaasthedivisionoperation thatismentionedabove:eachoperation(squaringofthevaluesof x andcentering thesevaluesontheiraverage)isappliedonaper-elementbasis.

>sum(x)

[1]26049

>sum(x^2)

[1]67876431

>sum(x^2-mean(x))

[1]67850382

Inthepreviousexample, mean(x) isaconstant(whichdependsuponthedata butthatdoesnotvaryinthiscase)thatwesubtractfromeach x2 i (where i isthe observationnumberorindexinvariable x),whichamountsto11differentvaluesin total.Intheexpression x-x^2,ontheotherhand,thesquareof x issubtractedfrom eachelementof x (tenoperationsintotal,withtenpairsofdifferentnumberseach time).

Thecommands sort(), order() and rank() allowtheelementsofavariableto besortedorworkwiththeranksoftheobservations,thatistosaywiththeirposition inthevariable.

1.2.3. Managementofcategoricalvariables

Qualitativeorcategoricalvariableshavetheirowndistinctstatusinmoststatistical packages:theirmodalitiesarerepresentedbynumbers(1,2,3,etc.)butgenerally theyareassociatedwithidentiﬁers,called labels in R.Inthepreviousexample, dataabouttheweightofbabiesatbirth(x)andtheweightoftheirmother(y)were available.Supposethatitisalsoknownifthemotherwassmokingduringtheﬁrst trimesterofherpregnancy,andletthisvariablebecalled z.Whenthemotherdidnot smokeduringthisperiod,thevariableisequalto1;whenthemotherwassmoking, thevariableisequalto2.Theaugmenteddataofvariable z arepresentedinTable1.2.

Table1.2. Augmentedartiﬁcialdataaboutweightsatbirth

Thedatawillbeenteredin R aswasdoneforthevariables x and y:

>z<-c(1,1,2,2,2,1,1,1,2,2)

[1]1122211122

The factor(z) commandallowsthisnumericalvariablebeconvertedintoa categoricalvariable.Toassociatelabelstonumericalcodes,the labels= optionis used.The levels= optionenablesthenumericalcodesofthelevelsofthevariableto bespeciﬁed.Thevalue1willbeassociatedwiththeNSmodalityandthevalue2to theSmode:

>factor(z,levels=c(1,2),labels=c("NS","S")) [1]NSNSSSSNSNSNSSS

Levels:NSS

When R displaysthecontentsofvariable z,itiseffectivelyavariableofthe factor type,withtwolevels,thatareunorderedhereasNSandS.Someoperations, suchasthecalculationofthearithmeticmean,donotgenerallymakeanysensewith thistypeofvariableandthisiswhatalsohappenswith R.Simpleorcrossed tabulationoperationshoweverremainentirelyvalid.Anotheraspectimportant:the labelsassociatedwithaqualitativevariablecanberecoveredbyusingthe levels command,while nlevels returnsthetotalnumberofmodalities:

>z<-factor(z,labels=c("NS","S"))

>levels(z) [1]"NS""S" >nlevels(z) [1]2

1.2.4. Manipulationofcategoricalvariables

Itisoftennecessarytorecodea k -classqualitativevariableinto j<k classesor associate labels withthenumericalmodalitiesofaqualitativevariable.Forexample, supposethatdataonforcedexpiratoryvolumein1second(FEV1)(variable fev1) fromtenpatientsareprovidedintheformofathree-tierorderedvariable:critical

8BiostatisticsandComputer-basedAnalysisofHealthDatausingR

(1),low(2)andnormal(3).Here,the sample() commandwillbeused,whichallows randomdatatobegeneratedbyresamplingfromalistofvalues.Thisisan R command whichexpectsthreeoptions:theﬁrstshowsthelistofpermissiblevalues,thesecond thenumberofobservationstobegeneratedandthelastthetypeofrandomdrawthat issought:with(replace=TRUE)orwithout(replace=FALSE):

>fev1<-sample(1:3,10,replace=TRUE)

>fev1

[1]3213332212

Notethatthevaluesarerandomandarenotnecessarilyidenticalifusingthis commandinan R session.Tothisend,oneshouldusethe set.seed() commandto ensurethereproducibilityofthesesimulations.

First,thenumericalvalueswillbereplacedwiththelabelsdescribedpreviously(1 =critical,2=lowand3= normal):

>fev1<-factor(fev1,levels=c(1,2,3), labels=c("critical","low","normal"))

>fev1

[1]normallowcriticalnormalnormalnormallow [8]lowcriticallow

Levels:criticallownormal

Thedistributionofthecountsbyclasscanbeveriﬁedquicklyusingthe table() commandwhichenablessimpleorcross-tabulationsthatwillbediscussedinChapter 2:

>table(fev1)

fev1

criticallownormal 244

Iftheaimistorecodethethree-classvariable fev1 intoatwo-classvariableby aggregatingtheﬁrsttwomodalitiesorlevels,the levels() commandcanbeusedas follows:

>levels(fev1) [1]"critical""low""normal"

>levels(fev1)[1:2]<-"criticalorlow"

>levels(fev1) [1]"criticalorlow""normal"

>table(fev1)

fev1

Itshouldbeobservedthatthemodalitiesofthevariable fev1 haveeffectively beenmodifiedandthefrequenciesofthefirsttwooriginallevelscannowbefoundin thefirstlevelas criticalorlow.Itisadvisabletoalwaysverifythattherecoding operationsofthelevelsofacategoricalvariablehavebeencorrectlycarriedoutas expected,usingthe levels() or table() commands.

1.3.Selectionofobservations

1.3.1. Index-basedselection

Consideringtheweightdatadiscussedabove,onecouldselectmorethanone observation,forexamplethethirdandtheﬁfth:

>x[c(3,5)] [1]25572600

orfromthethirdtotheﬁfth:

>x[3:5] [1]255725942600

Theslightlypeculiarsyntax 3:5 actuallydesignatesthesequenceofintegers startingat3andendingat5.Thisis,therefore,strictlyequivalentto c(3,4,5).Thus toaccessspeciﬁcvaluesofavariable,itsufﬁcestoprovidealistofobservation numbers(orindex).Thesameindexingprincipleappliesofcoursetoqualitative variables:

>fev1[2] [1]criticalorlow

Levels:criticalorlownormal

1.3.2. Criterion-basedselection

Supposenowthatwewanttoobtaintheweightofbabieswhosemotherweighs lessthan50kg:

>x[y<50] [1]2557259426002637

Thepreviouscommandreads:selectthevaluesof x forwhichthe(logical) condition y<50 holds.Thisconditioncanbedirectlyvisualizedbytyping y<50 in R,whichyieldsthefollowingresult:

>y<50

[1]FALSEFALSETRUETRUETRUEFALSEFALSETRUEFALSEFALSE

Thisselectionprinciplebasedonanexternalcriterionremainsvalidwhenthe criterionisacategoricalvariable.Toobtainthelistofweightsofinfantswhose motherisanon-smoker,thefollowingcommandshouldbeentered:

>x[z=="NS"]

[1]25232551262226372637

Thedoubleequalsign(==)isbeingusedtodesignatethelogiccondition"z equal to NS".Finally,itispossibletocombinecriteriarelatingtovariablesofthesame typeorofdifferenttypesusingthelogicaloperators & (and)and | (or),mainly.The followingcommandreturnsthevaluesof x suchas z isequalto NS and y < 55 (weight ofbabieswhosemotherdoesnotsmokeandwhoweight55kgorless):

>x[z=="NS"&y<=55]

[1]26372637

1.4.Representationandprocessingofmissingvalues

Inreality,itisrareforfulldata(withoutmissingvalues)tobeavailable.Regardless ofthemannerdecidedtostatisticallyprocesssets,itisimportanttoensurethatthey arewellrepresentedassuchbythestatisticalsoftware.

Returningtothepreviousexample,supposethatthethirdobservationforthe weightofbabiesisactuallymissingorthatwewanttoprocessitassuch.Thisdatun isrepresentedbyadot(.)inTable1.3.

Table1.3. Artiﬁcialdataaboutweightatbirthincluding themissingdatum

Thismissingvaluecouldhavebeenrepresentedbythelackofvalue(suchasa blankcellinMicrosoftExcel)orbyanyothersymbol.Usingthesameapproachas whenvariable x issetfortheﬁrsttime,awayofcapturingtheweightofthebabies wouldbetoreplacethemissingvaluewiththeterm NA,whichisthetermreservedby R toencodethemissingvalues:

>c(2523,2551,NA,2594,2600,2622,2637,2637,2663,2665)

Sincevariable x hasalreadybeenentered,itcansimplybeupdated,inthiscase replacingthethirdelementwith NA:

>x[3]<-NA

[1]25232551NA2594260026222637263726632665

Careshouldbetaken,asthecommand length(x) alwaysreturns10:thereare effectivelytenobservations,butoneofthemismissing.Thecommand is.na() makesitpossibletoverifythemissingvaluesinavariable:

>is.na(x)

[1]FALSEFALSETRUEFALSEFALSEFALSEFALSEFALSEFALSEFALSE >which(is.na(x))

[1]3

Asseeninthecaseoftheobservationselection,thevaluesreturnedbythe command is.na() areBooleanvaluesequaltotrue(TRUE)whenthevalueof x is missing(NA),andfalse(FALSE)otherwise.Thiscommandwillthereforereturnas manyvaluesasvariable x contains.Tofacilitatetheidentiﬁcationofthenumberof missingobservations,itmaybepreferabletocombine is.na() withthecommand which().Itshouldbenotedthatcommandscanbecombinedinanintuitiveway:the which()command usestheresultreturnedby is.na(),withouthavingtocreatea dummyvariabletostoretheresult.Analternativeconsistsofcountingtheso-called completecaseswiththecommand complete.cases:

>complete.cases(x)

[1]TRUETRUEFALSETRUETRUETRUETRUETRUETRUETRUE

Usingtheprincipleofobservationselectionusedpreviously,thefollowing commandwillthusreturnthevaluesof y forwhich x hasanon-missingvalue:

>y[complete.cases(x)]

[1]82.770.549.148.656.453.646.855.951.4

1.5.Importingandstoringdata

1.5.1. Univariatedata

Thethreesetsoftenmeasuresdiscussedintheprecedingsectionswereofa sufﬁcientlymoderatesizetoallowtheirdirectentryin R.However,inmostcases,the

datawillhavealreadybeenstoredinanexternalﬁleandtheﬁrststepusuallyconsists ofimportingtheminto R.

Forexample,hereisthecontentsoftheﬁle poids.dat,whichisasimpletextﬁle thatcanbevisualizedwithanytexteditor:

2523255125572594260026222637263726632665

Inasimplecasewhereonlyonesetofmeasurementshastobeimported,the scan() commandissufficient.Itindicatesthelocationofthedatafile.Theterm “location”designatesthepreciselocationofthefileinthetreestructureofthe computer’sfilesystem.Here,itwillbeassumedthattheworkingdirectory(editable withthe setwd() commandorusingRStudioutilities)isthecurrentdirectory:

>x<-scan("poids.dat")

Read10items >head(x,n=3) [1]252325512557

The head() commandenablestheﬁrstobservationsofavariabletobedisplayed. Theoption n= enablestheirnumbertobespeciﬁed(bydefault, n=6).

1.5.2. Multivariatedata

Inmorecomplex,morerealisticcases,dataaremultivariatewiththevariables generallyorganizedincolumnsandtheobservationsinlines[WIC14]:eachlineof thefile,therefore,representsastatisticalunitinwhichseveralmeasurementsordata points(variablesorfields)havebeencollected,thelatterbeingseparatedonefrom anotherbyacomma,semicolons,spacesortabs.Inallcases,thecommandtouseis read.table() (thefielddelimiterisaspaceoratab)oroneofitsshortcuts: read.csv() (comma-typeseparator)and read.csv2() (semicolon-typeseparator).

Considertheﬁle birthwt.dat,whoseﬁrstthreelinesare:

019182200010 2523

033155300003 2551

020105110001 2557

Therearetencolumns(thatistenvariables)andeachvalueisseparatedfromthe nextbyaspace.Thenameofthevariablesdoesnotappearanywhere,butweknow thattheyarethefollowingvariables:weightstatusofthebabyatbirth low (=1if weight < 2 5 kg,0otherwise), age ofthemother(years), lwt weightofthemother(in pounds), race ethnicityofthemother(encodedinthreeclasses,1=white,2=black and3=other), smoke (=1ifconsumptionoftobaccoduringpregnancy,0otherwise),

ptl (numberofpreviousprematurelabours), ht (=1ifhistoryofhypertension,0 otherwise), ui (=1ifmanifestationofinteruterinepain,0otherwise), ftv (numberof consultationswithagynecologistduringtheﬁrsttrimesterofthepregnancy), bwt for theweightofthebabiesatbirth(ingrams).Hereishowthesedatacanbeimportedin R,alwaysassumingthatthedataﬁleislocatedintheworkingdirectoryof R:

>bt<-read.table("birthwt.dat",header=FALSE) >varnames<-c("low","age","lwt","race","smoke","ptl","ht", "ui","ftv","bwt")

>names(bt)<-varnames

>head(bt)

lowagelwtracesmokeptlhtuiftvbwt

10191822000102523 20331553000032551

30201051100012557

40211081100122594 50181071100102600 60211243000002622

Ifthedataﬁlehadhadthefollowingform:

0,19,182,2,0,0,0,1,0,2523

0,33,155,3,0,0,0,0,3,2551

0,20,105,1,1,0,0,0,1,2557

whichistypicallytheexportformatofferedbyaspreadsheetprogramsuchasExcel, thenthecommand read.table() couldhavesimplybeenreplacedby read.csv() (orchangetheoption sep= of read.table()).

1.5.3. Storingthedatainanexternalﬁle

Intheunivariatecaseasinthemultivariatecase,thedataprocessedin R canbe exportedusing write.table() or write.csv() tocreatetextﬁlesidenticaltothose discussedabove.Forexample:

>write.csv(bt,file="bt.csv")

willsavethefileimportedwiththe read.table() commandintheformofafile wherethevaluesofthevariablesareseparatedbycommasanditwillbepossibleto openthe bt.csv filewithaspreadsheetprogramsuchasMicrosoftExcel.

Ifdatamustbedirectlysavedinthe R format, save(bt,file="bt.RData"), willhavetobeused, RData (or rda)constitutingtheextensionreservedfor R database ﬁles.

>data(birthwt,package="MASS")

>c(nrow(birthwt),ncol(birthwt)) [1]18910

>names(birthwt)

[1]"low""age""lwt""race""smoke""ptl""ht" [8]"ui""ftv""bwt"

>head(birthwt,n=2)

lowagelwtracesmokeptlhtuiftvbwt 850191822000102523 860331553000032551

The birthwt variableisatable(dataframe)comprisingtenvariables,each comprising189observations.Therefore,informationon189statisticalunitsare availableinthisprospectivestudy.Toaccessthevaluesofaparticularvariable,the followingnotationwillbeused:tablename(birthwt)followedbythenameofthe variableprefixedwiththedollar($)sign.Forexample,thefirstfivevaluesforthe weightofbabies(bwt)arethus:

>birthwt$bwt[1:5] [1]25232551255725942600

Themeaningofeachvariableisgivenabove.Itisknownthatsomevariablesare strictlybinary(0/1),suchas low, smoke, ht and ui,whiletheothervariablesare numeric,eitherwithdiscretevaluessuchas ftv orwithvaluesassumedtobe continuoussuchas lwt or bwt.Weknowthatinthecaseofbinaryvariables,avalue of1meansthatthesignispresent(themothersmokes,themotherhasahistoryof highbloodpressure,etc.).Itcostsnothingtoaddmoreinformativelabelstothese variables.Ethnicity(race)ismorespeciﬁcbecausethisisaqualitativevariablebut isactuallyprocessedby R asanumericalvariable:

>summary(birthwt$race) Min.1stQu.MedianMean3rdQu.Max. 1.0001.0001.0001.8473.0003.000

The summary() commandallowsthedistributionofnumericalandcategorical variablestobesummarized,andappliesequallytoasinglevariableastoadatatable. Awaytorecodethemodalitiesofthequalitativevariablesforthe birthwt tableisas follows:

>yesno<-c("No","Yes")

>birthwt$smoke<-factor(birthwt$smoke,labels=yesno)

>birthwt$race<-factor(birthwt$race,levels=c(1,2,3), labels=c("White","Black","Other"))

Asareminder,aviralloadof100,000copies/mlisequivalentto5log.

Wewanttoanswerthefollowing:

a)Indicatehowmanypatientshaveaviralloadconsideredasnon-detectable.

b)Theresearcherrealizesthatthevalue3.04correspondstoadataentry errorandmustbechangedto3.64.Similarly,shehasadoubtabouttheseventh measurementanddecidestoconsideritasamissingvalue.Performthecorresponding transformations.

c)Whatisthemedianviralloadlevelincopies/ml,forthedataconsideredvalid?

Firstandforemost,itisnecessarytoexpressthedetectionlimit(50copies/ml)in logarithmicunits;thisis,infact,equalto:

>log10(50)

Next,weneedtoﬁltertheobservationsthatdonotverifythecondition X> 1 70 (theexactnumericresultwillbeused,nottheapproximatevalue):

>X<-c(3.64,2.27,1.43,1.77,4.62,3.04,1.01,2.14,3.02,5.62,5.51, 5.51,1.01,1.05,4.19,2.63,4.34,4.85,4.02,5.92)

>length(X[X<=log10(50)])

Toreplacetheobservationequalto3.04wewillperformasimplelogictestsothis observationisnotduplicatedintheobservationsseries:

>X[X==3.04]<-3.64

Concerningtheseventhobservation,wewillproceedinthesamemannerby updatingthevalueoftheobservationwiththecorrespondingindex:

>X[7]<-NA

Finally,themedianviralloadforpatientswithameasurementconsideredvalid canbecalculatedasfollows:

>Xc<-X[X>log10(50)] >round(median(10^Xc),0)

2)The dosage.txt ﬁlecontainsaseriesof15bioassays,storedinnumerical formatwiththreedecimalplacesasfollows: 6.3796.6835.120...

–use scan toreadthesedata(thoroughlyreadtheonlinehelpregardingtheuse ofthiscommand,particularlythe what= option);