Real
Estate Valuation and Strategy: A Guide for Family Offices and Their Advisors John A Kilpatrick
https://ebookmass.com/product/real-estate-valuation-and-strategy-aguide-for-family-offices-and-their-advisors-john-a-kilpatrick/
ebookmass.com
Language Disorders: A Functional Approach to Assessment and Intervention
https://ebookmass.com/product/language-disorders-a-functionalapproach-to-assessment-and-intervention/
ebookmass.com
Simulation Sheldon M. Ross
https://ebookmass.com/product/simulation-sheldon-m-ross/



ebookmass.com
Euroscepticism and the Future of European Integration De Vries
https://ebookmass.com/product/euroscepticism-and-the-future-ofeuropean-integration-de-vries/
ebookmass.com
Fortune Favors the Duke Kristin Vayden
https://ebookmass.com/product/fortune-favors-the-duke-kristinvayden-3/
ebookmass.com


Biostatistics and Computer-based Analysis of Health Data using R
Biostatistics and Health Science Set coordinated by Mounir Mesbah
Biostatistics and Computer-based Analysis of Health Data using R


First published 2016 in Great Britain and the United States by ISTE Press Ltd and Elsevier Ltd
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Press Ltd
27-37 St George’s Road
Elsevier Ltd
The Boulevard, Langford Lane London SW19 4EU Kidlington, Oxford, OX5 1GB UK UK
www.iste.co.uk
www.elsevier.com
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
For information on all our publications visit our website at http://store.elsevier.com/
© ISTE Press Ltd 2016
The rights of Christophe Lalanne and Mounir Mesbah to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library Library of Congress Cataloging in Publication Data
A catalog record for this book is available from the Library of Congress
ISBN 978-1-78548-088-1
Printed and bound in the UK and US
Alargenumberoftheactionsperformedbymeansofstatisticalsoftwareamount tomanipulatingoreventotransformingdigitaldatarepresentingstatisticaldata literally.Itisthereforeparamountweunderstandhowstatisticaldataarerepresented andhowtheycanbeusedbysoftwaresuchas R.Afterimporting,recodingandthe eventualtransformationofthesedata,thedescriptionofthevariablesofinterestand thesummaryoftheirdistributioninnumericalandgraphicalformconstituteaprior andfundamentalsteptoanystatisticalmodeling,hencetheimportanceoftheseearly stagesinastatisticalanalysisproject.Inthesecondstep,itisessentialtofully controlthecommandsthatenablethecalculationofthemainmeasuresofassociation inmedicalresearchandtoknowhowtoimplementtheconventionalexplanatoryand predictivemodels:varianceanalysis,linearandlogisticregressionandCoxmodel. Withfewexceptions,usingthe R commandsavailableduringtheinstallationofthe software(basecommands)isfavoredovertheuseofspecializedcommandsin external R packages.Thepackagesthatmustbeinstalledtofollowtheapplications presentedinthisbookarelistedinChapter1,insection1.1.
Thisbookassumesthatthereaderisalreadyfamiliarwithbasicstatistical concepts,particularlythecalculationofcentraltendencyanddispersionindicators foracontinuousvariable,contingencytables,analysisofvarianceandconventional regressionmodels.Theobjectiveistoapplythisknowledgeusingdatasetsdescribed innumerousotherworks,eveniftheinterpretationoftheresultsremainsminimal, andfamiliarizeoneselfquicklywiththeuseof R withactualdata.Emphasisisgiven tothemanagementandmanipulationofstructureddata,asthisconstitutes60to80% oftheworkofthestatistician.TherearemanyworksinFrenchandinEnglishon R, bothfromthetechnicalandstatisticalpointofview.Someoftheseworksare orientedtowardsgeneralaspects[SHA12],othersaremuchmorespecialized [BIL14]oraddressmoreadvancedconcepts[HOT09].Thepurposeofthisbookis toenablethereadertogetaccustomedto R sothathecanperformhisownanalyses
theworkofJohnVerzani,PaulMurrellandFranckHarrell[VER11,MUR05, HAR01].
Duetothedesignofthelayout,some R outputshavebeentruncatedor reformatted.Therefore,therecouldbedifferenceswhenthereaderattemptsto reproducethecommandsofthisbook.
Anindexofthe R commandsusedintheillustrationsisavailableattheendofthe book.
R ismorethanasimplesoftwareprogramforstatistics;itisalanguageforthe manipulationofstatisticaldata[IHA96,VEN02].Thispartlyexplainsitsdifficult nonuser-friendlyapproachforusersaccustomedtodrop-downmenussuchasthose offeredbySPSS(althoughSPSSalsooffersabasicmacrolanguage).Thischapter allowsthereadertodiscovertheelementsofthelanguageandtobecomefamiliarized withthemechanismsbywhichtorepresentstatisticaldatain R.Intheillustrations thatfollow,the R commandsareprefixedwiththesymbol >,whichdesignatesthe R consoleprompt.Itis,therefore,unnecessarytocopythissymboltotesttheproposed instructions.
1.1.Beforeproceeding
1.1.1.
InstallingR
Theinstallationof R isrelativelyeasyandinstructionscanbefoundatthe followingwebsite:http://cran.r-project.org.Thesoftwareprogramisavailablefor Windows,LinuxandMac.Inadditiontothe R program,theinstallerprovidesan R scripteditor,onlinehelpandasetofbasepackages.Thepackagesincludecommands specifictoacertainarea(graphicscommands,modelingcommands,etc.)andmakeit possibletoenhancethebasefeaturesof R.
1.1.2. RStudio
Although R issufficienttostartorperformstatisticalanalyses,RStudio(www. rstudio.com)providesaparticularlyenjoyableworkingenvironmentfor R.Itincludes apowerful R scripteditor,aconsoleinwhichtheusercanexecutethecommands (orsendthoseinsertedinthescripteditor),onlinehelp,agraphicsbrowser,adata
anticipatethatthe R commandscriptcouldbereusedinthefuturebyathirdperson. Commentsareusefulinthiscase.Youcanaddtheprefix # toalineoftextsothat canitbetreatedasacommentandnotasastatementby R.Thecommandsinthe R scriptshouldalsobeorganizedlogically,effectivelydistinguishingthecommands relatedtotheimportationofdata,tounivariateandbivariatedescriptivestatisticsand tostatisticalmodels,etc.Allthismustbepresentedinverydistinctsectionsandrelyas littleaspossibleonvariablesortablesoftemporarydatathatwouldsignificantlyalter theoriginaldatatable,withoutanydocumentationoranystoragetodisk.Theoriginal data,asamatteroffact,shouldalwaysbeaccessibleatanypointoftheanalysis. Intermediatedatatablescanbesavedseparately.Inthecaseoflargeanalysisprojects, itispreferabletocreateseveralscriptfilestostorethecommandscorrespondingto thedifferentstagesoftheanalysis.
1.2.DatarepresentationinR
1.2.1. Managementofnumericalvariables
Supposethatwehavetheweightoftennewbornsavailable(variablex,ingrams) andthatoftheirmother(variabley,inkg),asillustratedinTable1.1.
Table1.1. Artificialdataaboutweightsatbirth
Inthefollowingexample,avariablecalled x wascreatedinwhichtheobservations displayedinTable1.1arestoredintheformofasimplelistofnumbers:
>x<-c(2523,2551,2557,2594,2600,2622,2637,2637,2663,2665)
Thesymbol <- isbeingused(inpreferenceto =)toassociateaseriesof measurementstoadefinedvariable.Thesevaluescannowbedisplayedin R using the print(x) commandorbytypingthenameofthevariable:
>x
[1]2523255125572594260026222637263726632665
Wewillmakeuseofthesameprocedurewith y:
>y<-c(82.7,70.5,47.7,49.1,48.6,56.4,53.6,46.8,55.9,51.4)
>var(x)
[1]2376.767
Ascanbeobservedinthepreviousinstructions,itisperfectlypossibletocombine twocommandsthatreturnthesametypeofresult,forexample c(min(x),max(x))
Asinthecaseofcreatingofalistofnumbers,thecommand c() allowstheresultsof commandsreturningnumberstobeassociatedwiththesamelist.
Thefollowingillustrationsarebasedonthesameideaasthedivisionoperation thatismentionedabove:eachoperation(squaringofthevaluesof x andcentering thesevaluesontheiraverage)isappliedonaper-elementbasis.
>sum(x)
[1]26049
>sum(x^2)
[1]67876431
>sum(x^2-mean(x))
[1]67850382
Inthepreviousexample, mean(x) isaconstant(whichdependsuponthedata butthatdoesnotvaryinthiscase)thatwesubtractfromeach x2 i (where i isthe observationnumberorindexinvariable x),whichamountsto11differentvaluesin total.Intheexpression x-x^2,ontheotherhand,thesquareof x issubtractedfrom eachelementof x (tenoperationsintotal,withtenpairsofdifferentnumberseach time).
Thecommands sort(), order() and rank() allowtheelementsofavariableto besortedorworkwiththeranksoftheobservations,thatistosaywiththeirposition inthevariable.
1.2.3. Managementofcategoricalvariables
Qualitativeorcategoricalvariableshavetheirowndistinctstatusinmoststatistical packages:theirmodalitiesarerepresentedbynumbers(1,2,3,etc.)butgenerally theyareassociatedwithidentifiers,called labels in R.Inthepreviousexample, dataabouttheweightofbabiesatbirth(x)andtheweightoftheirmother(y)were available.Supposethatitisalsoknownifthemotherwassmokingduringthefirst trimesterofherpregnancy,andletthisvariablebecalled z.Whenthemotherdidnot smokeduringthisperiod,thevariableisequalto1;whenthemotherwassmoking, thevariableisequalto2.Theaugmenteddataofvariable z arepresentedinTable1.2.
Table1.2. Augmentedartificialdataaboutweightsatbirth
Thedatawillbeenteredin R aswasdoneforthevariables x and y:
>z<-c(1,1,2,2,2,1,1,1,2,2)
>z
[1]1122211122
The factor(z) commandallowsthisnumericalvariablebeconvertedintoa categoricalvariable.Toassociatelabelstonumericalcodes,the labels= optionis used.The levels= optionenablesthenumericalcodesofthelevelsofthevariableto bespecified.Thevalue1willbeassociatedwiththeNSmodalityandthevalue2to theSmode:
>factor(z,levels=c(1,2),labels=c("NS","S")) [1]NSNSSSSNSNSNSSS
Levels:NSS
When R displaysthecontentsofvariable z,itiseffectivelyavariableofthe factor type,withtwolevels,thatareunorderedhereasNSandS.Someoperations, suchasthecalculationofthearithmeticmean,donotgenerallymakeanysensewith thistypeofvariableandthisiswhatalsohappenswith R.Simpleorcrossed tabulationoperationshoweverremainentirelyvalid.Anotheraspectimportant:the labelsassociatedwithaqualitativevariablecanberecoveredbyusingthe levels command,while nlevels returnsthetotalnumberofmodalities:
>z<-factor(z,labels=c("NS","S"))
>levels(z) [1]"NS""S" >nlevels(z) [1]2
1.2.4. Manipulationofcategoricalvariables
Itisoftennecessarytorecodea k -classqualitativevariableinto j<k classesor associate labels withthenumericalmodalitiesofaqualitativevariable.Forexample, supposethatdataonforcedexpiratoryvolumein1second(FEV1)(variable fev1) fromtenpatientsareprovidedintheformofathree-tierorderedvariable:critical
8BiostatisticsandComputer-basedAnalysisofHealthDatausingR
(1),low(2)andnormal(3).Here,the sample() commandwillbeused,whichallows randomdatatobegeneratedbyresamplingfromalistofvalues.Thisisan R command whichexpectsthreeoptions:thefirstshowsthelistofpermissiblevalues,thesecond thenumberofobservationstobegeneratedandthelastthetypeofrandomdrawthat issought:with(replace=TRUE)orwithout(replace=FALSE):
>fev1<-sample(1:3,10,replace=TRUE)
>fev1
[1]3213332212
Notethatthevaluesarerandomandarenotnecessarilyidenticalifusingthis commandinan R session.Tothisend,oneshouldusethe set.seed() commandto ensurethereproducibilityofthesesimulations.
First,thenumericalvalueswillbereplacedwiththelabelsdescribedpreviously(1 =critical,2=lowand3= normal):
>fev1<-factor(fev1,levels=c(1,2,3), labels=c("critical","low","normal"))
>fev1
[1]normallowcriticalnormalnormalnormallow [8]lowcriticallow
Levels:criticallownormal
Thedistributionofthecountsbyclasscanbeverifiedquicklyusingthe table() commandwhichenablessimpleorcross-tabulationsthatwillbediscussedinChapter 2:
>table(fev1)
fev1
criticallownormal 244
Iftheaimistorecodethethree-classvariable fev1 intoatwo-classvariableby aggregatingthefirsttwomodalitiesorlevels,the levels() commandcanbeusedas follows:
>levels(fev1) [1]"critical""low""normal"
>levels(fev1)[1:2]<-"criticalorlow"
>levels(fev1) [1]"criticalorlow""normal"
>table(fev1)
fev1
Itshouldbeobservedthatthemodalitiesofthevariable fev1 haveeffectively beenmodifiedandthefrequenciesofthefirsttwooriginallevelscannowbefoundin thefirstlevelas criticalorlow.Itisadvisabletoalwaysverifythattherecoding operationsofthelevelsofacategoricalvariablehavebeencorrectlycarriedoutas expected,usingthe levels() or table() commands.
1.3.Selectionofobservations
1.3.1. Index-basedselection
Consideringtheweightdatadiscussedabove,onecouldselectmorethanone observation,forexamplethethirdandthefifth:
>x[c(3,5)] [1]25572600
orfromthethirdtothefifth:
>x[3:5] [1]255725942600
Theslightlypeculiarsyntax 3:5 actuallydesignatesthesequenceofintegers startingat3andendingat5.Thisis,therefore,strictlyequivalentto c(3,4,5).Thus toaccessspecificvaluesofavariable,itsufficestoprovidealistofobservation numbers(orindex).Thesameindexingprincipleappliesofcoursetoqualitative variables:
>fev1[2] [1]criticalorlow
Levels:criticalorlownormal
1.3.2. Criterion-basedselection
Supposenowthatwewanttoobtaintheweightofbabieswhosemotherweighs lessthan50kg:
>x[y<50] [1]2557259426002637
Thepreviouscommandreads:selectthevaluesof x forwhichthe(logical) condition y<50 holds.Thisconditioncanbedirectlyvisualizedbytyping y<50 in R,whichyieldsthefollowingresult:
>y<50
[1]FALSEFALSETRUETRUETRUEFALSEFALSETRUEFALSEFALSE
Thisselectionprinciplebasedonanexternalcriterionremainsvalidwhenthe criterionisacategoricalvariable.Toobtainthelistofweightsofinfantswhose motherisanon-smoker,thefollowingcommandshouldbeentered:
>x[z=="NS"]
[1]25232551262226372637
Thedoubleequalsign(==)isbeingusedtodesignatethelogiccondition"z equal to NS".Finally,itispossibletocombinecriteriarelatingtovariablesofthesame typeorofdifferenttypesusingthelogicaloperators & (and)and | (or),mainly.The followingcommandreturnsthevaluesof x suchas z isequalto NS and y < 55 (weight ofbabieswhosemotherdoesnotsmokeandwhoweight55kgorless):
>x[z=="NS"&y<=55]
[1]26372637
1.4.Representationandprocessingofmissingvalues
Inreality,itisrareforfulldata(withoutmissingvalues)tobeavailable.Regardless ofthemannerdecidedtostatisticallyprocesssets,itisimportanttoensurethatthey arewellrepresentedassuchbythestatisticalsoftware.
Returningtothepreviousexample,supposethatthethirdobservationforthe weightofbabiesisactuallymissingorthatwewanttoprocessitassuch.Thisdatun isrepresentedbyadot(.)inTable1.3.
Table1.3. Artificialdataaboutweightatbirthincluding themissingdatum
Thismissingvaluecouldhavebeenrepresentedbythelackofvalue(suchasa blankcellinMicrosoftExcel)orbyanyothersymbol.Usingthesameapproachas whenvariable x issetforthefirsttime,awayofcapturingtheweightofthebabies wouldbetoreplacethemissingvaluewiththeterm NA,whichisthetermreservedby R toencodethemissingvalues:
>c(2523,2551,NA,2594,2600,2622,2637,2637,2663,2665)
Sincevariable x hasalreadybeenentered,itcansimplybeupdated,inthiscase replacingthethirdelementwith NA:
>x[3]<-NA
>x
[1]25232551NA2594260026222637263726632665
Careshouldbetaken,asthecommand length(x) alwaysreturns10:thereare effectivelytenobservations,butoneofthemismissing.Thecommand is.na() makesitpossibletoverifythemissingvaluesinavariable:
>is.na(x)
[1]FALSEFALSETRUEFALSEFALSEFALSEFALSEFALSEFALSEFALSE >which(is.na(x))
[1]3
Asseeninthecaseoftheobservationselection,thevaluesreturnedbythe command is.na() areBooleanvaluesequaltotrue(TRUE)whenthevalueof x is missing(NA),andfalse(FALSE)otherwise.Thiscommandwillthereforereturnas manyvaluesasvariable x contains.Tofacilitatetheidentificationofthenumberof missingobservations,itmaybepreferabletocombine is.na() withthecommand which().Itshouldbenotedthatcommandscanbecombinedinanintuitiveway:the which()command usestheresultreturnedby is.na(),withouthavingtocreatea dummyvariabletostoretheresult.Analternativeconsistsofcountingtheso-called completecaseswiththecommand complete.cases:
>complete.cases(x)
[1]TRUETRUEFALSETRUETRUETRUETRUETRUETRUETRUE
Usingtheprincipleofobservationselectionusedpreviously,thefollowing commandwillthusreturnthevaluesof y forwhich x hasanon-missingvalue:
>y[complete.cases(x)]
[1]82.770.549.148.656.453.646.855.951.4
1.5.Importingandstoringdata
1.5.1. Univariatedata
Thethreesetsoftenmeasuresdiscussedintheprecedingsectionswereofa sufficientlymoderatesizetoallowtheirdirectentryin R.However,inmostcases,the
datawillhavealreadybeenstoredinanexternalfileandthefirststepusuallyconsists ofimportingtheminto R.
Forexample,hereisthecontentsofthefile poids.dat,whichisasimpletextfile thatcanbevisualizedwithanytexteditor:
2523255125572594260026222637263726632665
Inasimplecasewhereonlyonesetofmeasurementshastobeimported,the scan() commandissufficient.Itindicatesthelocationofthedatafile.Theterm “location”designatesthepreciselocationofthefileinthetreestructureofthe computer’sfilesystem.Here,itwillbeassumedthattheworkingdirectory(editable withthe setwd() commandorusingRStudioutilities)isthecurrentdirectory:
>x<-scan("poids.dat")
Read10items >head(x,n=3) [1]252325512557
The head() commandenablesthefirstobservationsofavariabletobedisplayed. Theoption n= enablestheirnumbertobespecified(bydefault, n=6).
1.5.2. Multivariatedata
Inmorecomplex,morerealisticcases,dataaremultivariatewiththevariables generallyorganizedincolumnsandtheobservationsinlines[WIC14]:eachlineof thefile,therefore,representsastatisticalunitinwhichseveralmeasurementsordata points(variablesorfields)havebeencollected,thelatterbeingseparatedonefrom anotherbyacomma,semicolons,spacesortabs.Inallcases,thecommandtouseis read.table() (thefielddelimiterisaspaceoratab)oroneofitsshortcuts: read.csv() (comma-typeseparator)and read.csv2() (semicolon-typeseparator).
Considerthefile birthwt.dat,whosefirstthreelinesare:
019182200010 2523
033155300003 2551
020105110001 2557
Therearetencolumns(thatistenvariables)andeachvalueisseparatedfromthe nextbyaspace.Thenameofthevariablesdoesnotappearanywhere,butweknow thattheyarethefollowingvariables:weightstatusofthebabyatbirth low (=1if weight < 2 5 kg,0otherwise), age ofthemother(years), lwt weightofthemother(in pounds), race ethnicityofthemother(encodedinthreeclasses,1=white,2=black and3=other), smoke (=1ifconsumptionoftobaccoduringpregnancy,0otherwise),
ptl (numberofpreviousprematurelabours), ht (=1ifhistoryofhypertension,0 otherwise), ui (=1ifmanifestationofinteruterinepain,0otherwise), ftv (numberof consultationswithagynecologistduringthefirsttrimesterofthepregnancy), bwt for theweightofthebabiesatbirth(ingrams).Hereishowthesedatacanbeimportedin R,alwaysassumingthatthedatafileislocatedintheworkingdirectoryof R:
>bt<-read.table("birthwt.dat",header=FALSE) >varnames<-c("low","age","lwt","race","smoke","ptl","ht", "ui","ftv","bwt")
>names(bt)<-varnames
>head(bt)
lowagelwtracesmokeptlhtuiftvbwt
10191822000102523 20331553000032551
30201051100012557
40211081100122594 50181071100102600 60211243000002622
Ifthedatafilehadhadthefollowingform:
0,19,182,2,0,0,0,1,0,2523
0,33,155,3,0,0,0,0,3,2551
0,20,105,1,1,0,0,0,1,2557
whichistypicallytheexportformatofferedbyaspreadsheetprogramsuchasExcel, thenthecommand read.table() couldhavesimplybeenreplacedby read.csv() (orchangetheoption sep= of read.table()).
1.5.3. Storingthedatainanexternalfile
Intheunivariatecaseasinthemultivariatecase,thedataprocessedin R canbe exportedusing write.table() or write.csv() tocreatetextfilesidenticaltothose discussedabove.Forexample:
>write.csv(bt,file="bt.csv")
willsavethefileimportedwiththe read.table() commandintheformofafile wherethevaluesofthevariablesareseparatedbycommasanditwillbepossibleto openthe bt.csv filewithaspreadsheetprogramsuchasMicrosoftExcel.
Ifdatamustbedirectlysavedinthe R format, save(bt,file="bt.RData"), willhavetobeused, RData (or rda)constitutingtheextensionreservedfor R database files.
>data(birthwt,package="MASS")
>c(nrow(birthwt),ncol(birthwt)) [1]18910
>names(birthwt)
[1]"low""age""lwt""race""smoke""ptl""ht" [8]"ui""ftv""bwt"
>head(birthwt,n=2)
lowagelwtracesmokeptlhtuiftvbwt 850191822000102523 860331553000032551
The birthwt variableisatable(dataframe)comprisingtenvariables,each comprising189observations.Therefore,informationon189statisticalunitsare availableinthisprospectivestudy.Toaccessthevaluesofaparticularvariable,the followingnotationwillbeused:tablename(birthwt)followedbythenameofthe variableprefixedwiththedollar($)sign.Forexample,thefirstfivevaluesforthe weightofbabies(bwt)arethus:
>birthwt$bwt[1:5] [1]25232551255725942600
Themeaningofeachvariableisgivenabove.Itisknownthatsomevariablesare strictlybinary(0/1),suchas low, smoke, ht and ui,whiletheothervariablesare numeric,eitherwithdiscretevaluessuchas ftv orwithvaluesassumedtobe continuoussuchas lwt or bwt.Weknowthatinthecaseofbinaryvariables,avalue of1meansthatthesignispresent(themothersmokes,themotherhasahistoryof highbloodpressure,etc.).Itcostsnothingtoaddmoreinformativelabelstothese variables.Ethnicity(race)ismorespecificbecausethisisaqualitativevariablebut isactuallyprocessedby R asanumericalvariable:
>summary(birthwt$race) Min.1stQu.MedianMean3rdQu.Max. 1.0001.0001.0001.8473.0003.000
The summary() commandallowsthedistributionofnumericalandcategorical variablestobesummarized,andappliesequallytoasinglevariableastoadatatable. Awaytorecodethemodalitiesofthequalitativevariablesforthe birthwt tableisas follows:
>yesno<-c("No","Yes")
>birthwt$smoke<-factor(birthwt$smoke,labels=yesno)
>birthwt$race<-factor(birthwt$race,levels=c(1,2,3), labels=c("White","Black","Other"))
Asareminder,aviralloadof100,000copies/mlisequivalentto5log.
Wewanttoanswerthefollowing:
a)Indicatehowmanypatientshaveaviralloadconsideredasnon-detectable.
b)Theresearcherrealizesthatthevalue3.04correspondstoadataentry errorandmustbechangedto3.64.Similarly,shehasadoubtabouttheseventh measurementanddecidestoconsideritasamissingvalue.Performthecorresponding transformations.
c)Whatisthemedianviralloadlevelincopies/ml,forthedataconsideredvalid?
Firstandforemost,itisnecessarytoexpressthedetectionlimit(50copies/ml)in logarithmicunits;thisis,infact,equalto:
>log10(50)
Next,weneedtofiltertheobservationsthatdonotverifythecondition X> 1 70 (theexactnumericresultwillbeused,nottheapproximatevalue):
>X<-c(3.64,2.27,1.43,1.77,4.62,3.04,1.01,2.14,3.02,5.62,5.51, 5.51,1.01,1.05,4.19,2.63,4.34,4.85,4.02,5.92)
>length(X[X<=log10(50)])
Toreplacetheobservationequalto3.04wewillperformasimplelogictestsothis observationisnotduplicatedintheobservationsseries:
>X[X==3.04]<-3.64
Concerningtheseventhobservation,wewillproceedinthesamemannerby updatingthevalueoftheobservationwiththecorrespondingindex:
>X[7]<-NA
Finally,themedianviralloadforpatientswithameasurementconsideredvalid canbecalculatedasfollows:
>Xc<-X[X>log10(50)] >round(median(10^Xc),0)
2)The dosage.txt filecontainsaseriesof15bioassays,storedinnumerical formatwiththreedecimalplacesasfollows: 6.3796.6835.120...
–use scan toreadthesedata(thoroughlyreadtheonlinehelpregardingtheuse ofthiscommand,particularlythe what= option);