Buy ebook Bioinformatics: a practical guide to the analysis of genes and proteins 4th edition andrea

Page 1


Visit to download the full and correct content document: https://ebookmass.com/product/bioinformatics-a-practical-guide-to-the-analysis-of-ge nes-and-proteins-4th-edition-andreas-d-baxevanis-2/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 4th Edition Andreas D. Baxevanis

https://ebookmass.com/product/bioinformatics-a-practical-guideto-the-analysis-of-genes-and-proteins-4th-edition-andreas-dbaxevanis/

Fundamentals of Phonetics: A Practical Guide for Students (4th Edition ) 4th…

https://ebookmass.com/product/fundamentals-of-phonetics-apractical-guide-for-students-4th-edition-4th/

A Practical Guide to Gas Analysis by Gas Chromatography John Swinley

https://ebookmass.com/product/a-practical-guide-to-gas-analysisby-gas-chromatography-john-swinley/

Applied Time Series Analysis. A Practical Guide to Modeling and Forecasting Terence C. Mills

https://ebookmass.com/product/applied-time-series-analysis-apractical-guide-to-modeling-and-forecasting-terence-c-mills/

Fundamentals of Phonetics: A Practical Guide for Students 4th Edition, (Ebook PDF)

https://ebookmass.com/product/fundamentals-of-phonetics-apractical-guide-for-students-4th-edition-ebook-pdf/

Certified Paralegal Review Manual: A Practical Guide to CP Exam Preparation 4th Edition, (Ebook PDF)

https://ebookmass.com/product/certified-paralegal-review-manuala-practical-guide-to-cp-exam-preparation-4th-edition-ebook-pdf/

Enzymes.

A Practical Introduction to Structure, Mechanism, and Data Analysis 3rd Edition Robert A. Copeland

https://ebookmass.com/product/enzymes-a-practical-introductionto-structure-mechanism-and-data-analysis-3rd-edition-robert-acopeland/

A Practical Guide to Geriatric Neuropsychology Susan Mcpherson

https://ebookmass.com/product/a-practical-guide-to-geriatricneuropsychology-susan-mcpherson/

Digital Transformation of the Laboratory: A Practical Guide to the Connected Lab Klemen Zupancic

https://ebookmass.com/product/digital-transformation-of-thelaboratory-a-practical-guide-to-the-connected-lab-klemenzupancic/

Bioinformatics

Bioinformatics

FourthEdition

Thisfourtheditionfirstpublished2020 ©2020JohnWiley&Sons,Inc.

EditionHistory

Wiley-Blackwell(1e,2000),Wiley-Blackwell(2e,2001),Wiley-Blackwell(3e,2005)

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted, inanyformorbyanymeans,electronic,mechanical,photocopying,recordingorotherwise,exceptas permittedbylaw.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleisavailableathttp:// www.wiley.com/go/permissions.

TherightofAndreasD.Baxevanis,GaryD.Bader,andDavidS.Wisharttobeidentifiedastheauthorsofthe editorialmaterialinthisworkhasbeenassertedinaccordancewithlaw.

RegisteredOffice

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

EditorialOffice

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisit usatwww.wiley.com.

Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthat appearsinstandardprintversionsofthisbookmaynotbeavailableinotherformats.

LimitofLiability/DisclaimerofWarranty

Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakeno representationsorwarrantieswithrespecttotheaccuracyorcompletenessofthecontentsofthisworkand specificallydisclaimallwarranties,includingwithoutlimitationanyimpliedwarrantiesofmerchantabilityor fitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensales materialsorpromotionalstatementsforthiswork.Thefactthatanorganization,website,orproductis referredtointhisworkasacitationand/orpotentialsourceoffurtherinformationdoesnotmeanthatthe publisherandauthorsendorsetheinformationorservicestheorganization,website,orproductmayprovide orrecommendationsitmaymake.Thisworkissoldwiththeunderstandingthatthepublisherisnotengaged inrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmaynotbesuitableforyour situation.Youshouldconsultwithaspecialistwhereappropriate.Further,readersshouldbeawarethat websiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthisworkwaswrittenandwhen itisread.Neitherthepublishernorauthorsshallbeliableforanylossofprofitoranyothercommercial damages,includingbutnotlimitedtospecial,incidental,consequential,orotherdamages.

LibraryofCongressCataloging-in-PublicationData

Names:Baxevanis,AndreasD.,editor.|Bader,GaryD.,editor.|Wishart, DavidS.,editor.

Title:Bioinformatics/editedbyAndreasD.Baxevanis,GaryD.Bader,DavidS. Wishart.

Othertitles:Bioinformatics(Baxevanis)

Description:Fourthedition.|Hoboken,NJ:Wiley,2020.|Includes bibliographicalreferencesandindex.

Identifiers:LCCN2019030489(print)|ISBN9781119335580(cloth)|ISBN 9781119335962(adobepdf)|ISBN9781119335955(epub)

Subjects:MESH:ComputationalBiology–methods|SequenceAnalysis–methods |BaseSequence|Databases,NucleicAcid|Databases,Protein Classification:LCCQH324.2(print)|LCCQH324.2(ebook)|NLMQU 550.5.S4|DDC570.285–dc23

LCrecordavailableathttps://lccn.loc.gov/2019030489

LCebookrecordavailableathttps://lccn.loc.gov/2019030490

CoverDesign:Wiley

CoverImages:©DavidWishart,background©Suebsiri/GettyImages

Setin9.5/12.5ptSTIXTwoTextbySPiGlobal,Chennai,India

10987654321

Contents

Foreword vii

Preface ix

Contributors xi

AbouttheCompanionWebsite xvii

1BiologicalSequenceDatabases 1 AndreasD.Baxevanis

2InformationRetrievalfromBiologicalDatabases 19 AndreasD.Baxevanis

3AssessingPairwiseSequenceSimilarity:BLASTandFASTA 45 AndreasD.Baxevanis

4GenomeBrowsers 79 TyraG.Wolfsberg

5GenomeAnnotation 117 DavidS.Wishart

6PredictiveMethodsUsingRNASequences 155 MichaelF.Sloma,MichaelZuker,andDavidH.Mathews

7PredictiveMethodsUsingProteinSequences 185 JonasReeb,TatyanaGoldberg,YanayOfran,andBurkhardRost

8MultipleSequenceAlignments 227 FabianSievers,GeoffreyJ.Barton,andDesmondG.Higgins

9MolecularEvolutionandPhylogeneticAnalysis 251 EmmaJ.GriffithsandFionaS.L.Brinkman

10ExpressionAnalysis 279 MariekeL.Kuijjer,JosephN.Paulson,andJohnQuackenbush

11ProteomicsandProteinIdentificationbyMassSpectrometry 315 SadhnaPhanseandAndrewEmili

12ProteinStructurePredictionandAnalysis 363 DavidS.Wishart

13BiologicalNetworksandPathways 399 GaryD.Bader

14Metabolomics 437

DavidS.Wishart

15PopulationGenetics 481

LynnB.JordeandW.ScottWatkins

16MetagenomicsandMicrobialCommunityAnalysis 505 RobertG.Beiko

17TranslationalBioinformatics 537

SeanD.MooneyandStephenJ.Mooney

18StatisticalMethodsforBiologists 555 HunterN.B.Moseley

Appendices 583

Glossary 591 Index 609

Foreword

AsIreviewthematerialpresentedinthefourtheditionof Bioinformatics Iammovedintwo ways,relatedtoboththepastandthefuture.

Lookingtothepast,Iammovedbytheamazingevolutionthathasoccurredinourfield sincethefirsteditionofthisbookappearedin1998.Twenty-oneyearsisalong,longtimein anyscientificfield,butespeciallysointheagilefieldofbioinformatics.Tousethewell-trodden metaphorofthe“biologymoonshot,”thelaunchpadatthebeginningofthetwenty-firstcenturywasthedeterminationofthehumangenome.Discoveryisnottherightwordforwhat transpired–weknewitwasthereandwhatwasneeded.Synergyisperhapsabetterword; synergyoftechnologicaldevelopment,experiment,computation,andpolicy.Atrulycollaborative efforttocontinuouslyshare,inareusableway,thecollectiveeffortsofmanyscientists. Bioinformaticswasbornfromthissynergyandhascontinuedtogrowandflourishbasedon theseprinciples.

Thatgrowthisreflectedinboththescopeanddepthofwhatiscoveredinthesepages.These attributesareareflectionoftheincreasedcomplexityofthebiologicalsystemsthatwestudy (movingfrom“simple”modelorganismstothehumancondition)andthescalesatwhich thosestudiestakeplace.Asacommunitywehaveprofessedmultiscalemodelingwithout muchtoshowforit,butitwouldseemtobefinallyhere.Wenowhavetheabilitytoconnectthe dotsfrommolecularinteractions,throughthepathwaystowhichthosemoleculesbelongto thecellstheyaffect,totheinteractionsbetweenthosecellsthroughtotheeffectstheyhaveon individualswithinapopulation.Toolsandmethodologiesthatwerenovelinearliereditions ofthisbookarenowroutineorobsolete,andnewer,faster,andmoreaccurateproceduresare nowwithus.Thiswillcontinue,andassuchthisbookprovidesavaluablesnapshotofthe scopeanddepthofthefieldasitexiststoday.

Lookingtothefuture,thisbookprovidesafoundationforwhatistocome.Formethisis afieldmoreaptlyreferredto(andperhapsanewsubtitleforthenextedition)asBiomedicalDataScience.SittingasIdonow,asDeanofaSchoolofDataSciencewhichcollaborates openlyacrossalldisciplines,Iseerapidchangeakintowhathappenedtobirthbioinformatics20ormoreyearsago.Itwillnottake20yearsforotherdisciplinestocatchup;Ipredictit willtake2!Theaccomplishmentsoutlinedinthisbookcanhelpdefinewhatotherdisciplines willaccomplishwiththeirowndataintheyearstocome.Statisticalmethods,cloudcomputing,dataanalytics,notablydeeplearning,themanagementoflargedata,visualization,ethics policy,andthelawsurroundingdataaregeneric.Bioinformaticshassomuchtooffer,yetit willalsobeinfluencedbyotherfieldsinawaythathasnothappenedbefore.Forty-fiveyears inacademiatellsmethatthereisnothingtocompareacrosscampusestowhatishappening today.Thisisbothanopportunityandathreat.Theeditorsandauthorsofthiseditionshould becomplimentedforsettingthestageforwhatistocome.

Preface

Inputtingtogetherthistextbook,wehopethatstudentsfromarangeoffields–including biology,computerscience,engineering,physics,mathematics,andstatistics–benefitbyhavingaconvenientstartingpointforlearningmostofthecoreconceptsandmanyusefulpractical skillsinthefieldofbioinformatics,alsoknownascomputationalbiology.

Studentsinterestedinbioinformaticsoftenaskabouthowshouldtheyacquiretrainingin suchaninterdisciplinaryfieldasthisone.Inanidealworld,studentswouldbecomeexperts inallthefieldsmentionedabove,butthisisactuallynotnecessaryandrealisticallytoomuch toask.Allthatisrequiredistocombinetheirscientificinterestswithafoundationinbiology andanysinglequantitativefieldoftheirchoosing.Whilethemostcommoncombinationis tomixbiologywithcomputerscience,incrediblediscoverieshavebeenmadethroughfinding creativeintersectionswithanynumberofquantitativefields.Indeed,manyofthesequantitativefieldstypicallyoverlapagreatdeal,especiallygiventheirfoundationaluseofmathematics andcomputerprogramming.Thesenaturalrelationshipsbetweenfieldsprovidethefoundationforintegratingdiverseexpertiseandinsights,especiallywheninthecontextofperforming bioinformaticanalyses.

Whilebioinformaticsisoftenconsideredanindependentsubfieldofbiology,itislikelythat thenextgenerationofbiologistswillnotconsiderbioinformaticsasbeingseparateandwill insteadconsidergainingbioinformaticsanddatascienceskillsasnaturallyastheylearnhowto useapipette.Theywilllearnhowtoprogramacomputer,likelystartinginelementaryschool. Otherdatascienceknowledgeareas,suchasmath,statistics,machinelearning,dataprocessing,anddatavisualizationwillalsobepartofanycorecurriculum.Indeed,thechildrenofone oftheeditorsrecentlylearnedhowtoconstructbarplotsandotherdatachartsinkindergarten! ThesameeditoristeachingprogramminginR(animportantdatascienceprogramming language)toallincomingbiologygraduatestudentsathisuniversitystartingthisyear.

Asbioinformaticsanddatasciencebecomemorenaturallyintegratedinbiology,itisworth notingthatthesefieldsactivelyespouseacultureofopenscience.Thiscultureismotivatedby thinkingaboutwhywedoscienceinthefirstplace.Wemaybecuriousorlikeproblemsolving. Wecouldalsobemotivatedbythebenefitstohumanitythatscientificadvancesbring,such astangiblehealthandeconomicbenefits.Whateverthemotivatingfactor,itisclearthatthe most efficientwaytosolvehardproblemsistoworktogetherasateam,inacomplementary fashionandwithoutduplicationofeffort.Theonlywaytomakesurethisworkseffectively istoefficientlyshareknowledgeandcoordinateworkacrossdisciplinesandresearchgroups. Presentingscientificresultsinareproducibleway,suchasfreelysharingthecodeanddata underlyingtheresults,isalsocritical.Fortunately,thereareanincreasingnumberofresources thatcanhelpfacilitatethesegoals,includingthebioRxivpreprintserver,wherepaperscanbe sharedbeforetheverylongprocessofpeerreviewiscompleted;GitHub,forsharingcomputer code;anddatasciencenotebooktechnologythathelpscombinecode,figures,andtextinaway thatmakesiteasiertosharereproducibleandreusableresults.

Wehopethistextbookhelpscatalyzethistransitionofbiologytoaquantitative,data science-intensivefield.Asbiologicalresearchadvancesbecomeevermorebuiltoninterdisciplinary,open,andteamscience,progresswilldramaticallyspeedup,layingthegroundwork forfantasticnewdiscoveriesinthefuture.

x Preface

Wealsodeeplythankallofthechapterauthorsforcontributingtheirknowledgeandtime tohelpthemanyfuturereadersofthisbooklearnhowtoapplythemyriadbioinformatic techniquescoveredwithinthesepagestotheirownresearchquestions.

AndreasD.Baxevanis GaryD.Bader

DavidS.Wishart

Contributors

GaryD.Bader,PhD isaProfessoratTheDonnellyCentreattheUniversityofToronto, Toronto,Canada,andaleaderinthefieldofNetworkBiology.Garycompletedhis postdoctoralworkinChrisSander’sgroupintheComputationalBiologyCenter(cBio)at MemorialSloan-KetteringCancerCenterinNewYork.GarycompletedhisPhDinthe laboratoryofChristopherHogueintheDepartmentofBiochemistryattheUniversityof TorontoandaBScinBiochemistryatMcGillUniversityinMontreal.Dr.Baderuses molecularinteraction,pathway,and-omicsdatatogaina“causal”mechanistic understandingofnormalanddiseasephenotypes.Hislaboratorydevelopsnovel computationalapproachesthatcombinemolecularinteractionandpathwayinformation with-omicsdatatodevelopclinicallypredictivemodelsandidentifytherapeutically targetablepathways.HealsohelpsleadtheCytoscape,GeneMANIA,andPathwayCommons pathwayandnetworkanalysisprojects.

GeoffreyJ.Barton,PhD isProfessorofBioinformaticsandHeadoftheDivisionof ComputationalBiologyattheUniversityofDundeeSchoolofLifeSciences,Dundee,UK. BeforemovingtoDundeein2001,hewasHeadoftheProteinDataBankinEuropeandthe leaderoftheResearchandDevelopmentTeamattheEMBLEuropeanBioinformatics Institute(EBI).PriortojoiningEMBL-EBI,hewasHeadofGenomeInformaticsatthe WellcomeTrustCentreforHumanGenetics,UniversityofOxford,apositionheheld concurrentlywithaRoyalSocietyUniversityResearchFellowshipintheDepartmentof Biochemistry.Geoff’slongestrunningresearchinterestisusingcomputationalmethodsto studytherelationshipbetweenaprotein’ssequence,itsstructure,anditsfunction.Hisgroup hascontributedmanytoolsandtechniquesinthefieldofproteinsequenceandstructure analysisandstructureprediction.TwoofthebestknownaretheJalviewmultiplealignment visualizationandanalysisworkbench,whichisinusebyover70000groupsforresearchand teaching,andtheJPredmulti-neuralnetproteinsecondarystructurepredictionalgorithm, whichperformspredictionsonupto500000proteins/monthforusersworldwide.In additiontohisworkrelatedtoproteinsequenceandstructure,Geoffhascollaboratedon manyprojectsthatprobebiologicalprocessesusingproteomicandhigh-throughput sequencingapproaches.Geoff’sgrouphasdeepexpertiseinRNA-seqmethodsandhas recentlypublishedatwo-condition48-replicateRNA-seqstudythatisnowakeyreference workforusersofthistechnology.

AndreasD.Baxevanis,PhD istheDirectorofComputationalBiologyfortheNational InstitutesofHealth’s(NIH)IntramuralResearchProgram.HeisalsoaSeniorScientist leadingtheComputationalGenomicsUnitattheNIH’sNationalHumanGenomeResearch Institute,Bethesda,MD,USA.Hisresearchprogramiscenteredonprobingtheinterface betweengenomicsanddevelopmentalbiology,focusingonthesequencingandanalysisof invertebrategenomesthatcanyieldinsightsofrelevancetohumanhealth,particularlyinthe areasofregeneration,allorecognition,andstemcellbiology.Hisaccomplishmentshavebeen recognizedbytheBodossakiFoundation’sAcademicPrizeinMedicineandBiologyin2000,

Greece’shighestawardforyoungscientistsofGreekheritage.In2014,hewaselectedtothe JohnsHopkinsSocietyofScholars,recognizingalumniwhohaveachievedmarked distinctionintheirfieldofstudy.HewastherecipientoftheNIH’sRuthL.Kirschstein MentoringAwardin2015,inrecognitionofhiscommitmenttoscientifictraining,education, andmentoring.In2016,Dr.BaxevaniswaselectedasaSeniorMemberoftheInternational SocietyforComputationalBiologyforhissustainedcontributionstothefieldand,in2018,he waselectedasaFellowoftheAmericanAssociationfortheAdvancementofScienceforhis distinguishedcontributionstothefieldofcomparativegenomics.

RobertG.Beiko,PhD isaProfessorandAssociateDeanforResearchintheFacultyof ComputerScienceatDalhousieUniversity,Halifax,NovaScotia,Canada.HeisaformerTier IICanadaResearchChairinBioinformatics(2007–2017),anAssociateEditoratmSystems andBMCBioinformatics,andafoundingorganizeroftheCanadianBioinformatics WorkshopsinMetagenomicsandGenomicEpidemiology.Heisalsotheleadeditorofthe recentlypublishedbook MicrobiomeAnalysis intheMethodsinMolecularBiologyseries.His researchfocusesonmicrobialgenomics,evolution,andecology,withconcentrationsinthe areaoflateralgenetransferandmicrobialcommunityanalysis.

isaProfessorinBioinformaticsandGenomicsinthe DepartmentofMolecularBiologyandBiochemistryatSimonFraserUniversity,Vancouver, BritishColumbia,Canada,withcross-appointmentsinComputingScienceandtheFacultyof HealthSciences.Sheismostknownforherresearchanddevelopmentofwidelyused computersoftwarethataidsbothmicrobe(PSORTb,IslandViewer)andhumangenomic (InnateDB)evolutionary/genomicsanalyses,alongwithherinsightsintopathogen evolution.Sheiscurrentlyco-leadinganationaleffort–theIntegratedRapidInfectious DiseaseAnalysisProject–thegoalofwhichistousemicrobialgenomesasafingerprintto bettertrackandunderstandthespreadandevolutionofinfectiousdiseases.Shehasalso beenleadingdevelopmentintoanapproachtointegrateverydiversedatafortheCanadian CHILDStudybirthcohort,includingmicrobiome,genomic,epigenetic,environmental,and socialdata.Shecoordinatescommunity-basedgenomeannotationanddatabase developmentforresourcessuchasthePseudomonasGenomeDatabase.Shealsohasastrong interestinbioinformaticseducation,includingdevelopingthefirstundergraduatecurricula usedasthebasisforthefirstWhitePaperonCanadianBioinformaticsTrainingin2002.She isonseveralcommitteesandadvisoryboards,includingtheBoardofDirectorsforGenome Canada;shechairstheScientificAdvisoryBoardfortheEuropeanNucleotideArchive (EMBL-EBI).Shehasreceivedanumberofawards,includingaTR100awardfromMIT,and, mostrecently,wasnamedasaFellowoftheRoyalSocietyofCanada.

AndrewEmili,PhD isaProfessorintheDepartmentsofBiochemistry(MedicalSchool)and Biology(ArtsandSciences)atBostonUniversity(BU),Boston,MA,USA,andtheinaugural DirectoroftheBUCenterforNetworkSystemsBiology(CNSB).PriortoBoston,Dr.Emili wasafoundingmemberandPrincipalInvestigatorfor18yearsattheDonnellyCenterfor CellularandBiomolecularResearchattheUniversityofToronto,oneofthepremierresearch centersinintegrativemolecularbiology.Dr.Emiliisaninternationallyrecognizedleaderin functionalproteomics,systemsbiology,andprecisionmassspectrometry.Hisgroupdevelops andappliesinnovativetechnologiestosystematicallymapproteininteractionnetworksand macromolecularcomplexesofcellsandtissuesonaglobalscale,publishing“interactome” mapsofunprecedentedquality,scope,andresolution.

TatyanaGoldberg,PhD isapostdoctoralscientistattheTechnicalUniversityofMunich, Germany.SheobtainedherPhDinBioinformaticsunderthesupervisionofDr.Burkhard Rost.Herresearchfocusesondevelopingmodelsthatcanpredictthelocalizationofproteins withincells.Theresultsofherstudycontributetoavarietyofapplications,includingthe developmentofpharmaceuticalsforthetreatmentofAlzheimerdiseaseandcancer.

EmmaJ.Griffiths,PhD isaresearchassociateintheDepartmentofPathologyandLaboratory MedicineattheUniversityofBritishColumbiainVancouver,Canada,workingwithDr. WilliamHsiao.Dr.GriffithsreceivedherPhDfromtheDepartmentofBiochemistryand BiomedicalSciencesatMcMasterUniversityinHamilton,Canada,withherdoctoralwork focusingontheevolutionaryrelationshipsbetweendifferentgroupsofbacteria.Shehassince pursuedpostdoctoraltraininginthefieldsofchemicalandfungalgeneticsandmicrobial genomicswithDr.FionaBrinkmanintheDepartmentofBiochemistryandMolecular BiologyatSimonFraserUniversityinVancouver,Canada.Hercurrentworkfocusesonthe developmentofontology-drivenapplicationsdesignedtoimprovepathogengenomics contextualdata(“metadata”)exchangeduringpublichealthinvestigations.

DesmondG.Higgins,PhD isProfessorofBioinformaticsinUniversityCollegeDublin,Ireland, wherehislaboratoryworksongenomicdataanalysisandsequencealignmentalgorithms.He earnedhisdoctoraldegreeinzoologyfromTrinityCollegeDublin,Ireland,andhasworkedin thefieldofbioinformaticssince1985.HisgroupmaintainsanddevelopstheClustalpackage formultiplesequencealignmentincollaborationwithgroupsinFrance,Germany,andthe UnitedKingdom.Dr.HigginswrotethefirstversionofClustalinDublinin1988.Hethen movedtotheEMBLDataLibrarygrouplocatedinHeidelbergin1990andlatertoEMBL-EBI inHinxton.ThiscoincidedwiththereleaseofClustalWand,later,ClustalX,whichhasbeen extremelywidelyusedandcited.Currently,hehasrunoutofversionletterssoisworkingon ClustalOmega,specificallydesignedformakingextremelylargeproteinalignments.

LynnB.Jorde,PhD hasbeenonthefacultyoftheUniversityofUtahSchoolofMedicine,Salt LakeCity,UT,USA,since1979andholdstheMarkandKathieMillerPresidentialEndowed ChairinHumanGenetics.HewasappointedChairoftheDepartmentofHumanGeneticsin September2009.Dr.Jorde’slaboratoryhaspublishedscientificarticlesonhumangenetic variation,high-altitudeadaptation,thegeneticbasisofhumanlimbmalformations,andthe geneticsofcommondiseasessuchashypertension,juvenileidiopathicarthritis,and inflammatoryboweldisease.Dr.Jordeistheleadauthorof MedicalGenetics,atextbookthat isnowinitsfiftheditionandtranslatedintomultipleforeignlanguages.Heisthe co-recipientofthe2008AwardforExcellenceinEducationfromtheAmericanSocietyof HumanGenetics(ASHG).Heservedtwo3-yeartermsontheBoardofDirectorsofASHG and,in2011,hewaselectedaspresidentofASHG.In2012,hewaselectedasaFellowofthe AmericanAssociationfortheAdvancementofScience.

MariekeL.Kuijjer,PhD isaGroupLeaderattheCentreforMolecularMedicineNorway (NCMM,aNordicEMBLpartner),UniversityofOslo,Norway,wheresherunsthe ComputationalBiologyandSystemsMedicinegroup.Sheobtainedherdoctorateinthe laboratoryofDr.PancrasHogendoornintheDepartmentofPathologyattheLeiden UniversityMedicalCenterintheNetherlands.Afterthis,shecontinuedherscientific trainingasapostdoctoralresearcherinthelaboratoryofDr.JohnQuackenbushatthe Dana-FarberCancerInstituteandHarvardT.H.ChanSchoolofPublicHealth,duringwhich shewonacareerdevelopmentawardandapostdoctoralfellowship.Dr.Kuijjer’sresearch focusesonsolvingfundamentalbiologicalquestionsthroughthedevelopmentofnew methodsincomputationalandsystemsbiologyandonimplementingthesetechniquesto betterunderstandgeneregulationincancer.Dr.Kuijjerservesontheeditorialboardof CancerResearch

DavidH.Mathews,MD,PhD isaprofessorofBiochemistryandBiophysicsandalsoof BiostatisticsandComputationalBiologyattheUniversityofRochesterMedicalCenter, Rochester,NY,USA.HealsoservesastheAssociateDirectoroftheUniversityofRochester’s CenterforRNABiology.HisinvolvementineducationincludesdirectingtheBiophysicsPhD programandteachingacourseinPythonprogrammingandalgorithmsfordoctoralstudents withoutaprogrammingbackground.HisgroupstudiesRNAbiologyanddevelopsmethods

forRNAsecondarystructurepredictionandmolecularmodelingofthree-dimensional structure.HisgroupdevelopedandmaintainsRNAstructure,awidelyusedsoftwarepackage forRNAstructurepredictionandanalysis.

SeanD.Mooney,PhD hasspenthiscareerasaresearcherandgroupleaderinbiomedical informatics.HenowleadsResearchITforUWMedicineandisleadingeffortstosupportand buildclinicalresearchinformaticplatformsasitsfirstChiefResearchInformationOfficer (CRIO)andasaProfessorintheDepartmentofBiomedicalInformaticsandMedical EducationattheUniversityofWashington,Seattle,WA,USA.Previoustobeingappointedas CRIO,hewasanAssociateProfessorandDirectorofBioinformaticsattheBuckInstitutefor ResearchonAging.AsanAssistantProfessor,hewasappointedinMedicalandMolecular GeneticsatIndianaUniversitySchoolofMedicineandwasthefoundingDirectorofthe IndianaUniversitySchoolofMedicineBioinformaticsCore.In1997,hereceivedhisBSwith DistinctioninBiochemistryandMolecularBiologyfromtheUniversityofWisconsinat Madison.HereceivedhisPhDfromtheUniversityofCaliforniainSanFranciscoin2001, thenpursuedhispostdoctoralstudiesunderanAmericanCancerSocietyJohnPeter HoffmanFellowshipatStanfordUniversity.

StephenJ.Mooney,PhD isanActingAssistantProfessorintheDepartmentofEpidemiology attheUniversityofWashington,Seattle,WA,USA.HedevelopedtheCANVASsystemfor collectingdatafromGoogleStreetViewimageryasagraduatestudent,andhisresearch focusesoncontextualinfluencesonphysicalactivityandtransport-relatedinjury.He’sa methodsgeekatheart.

HunterN.B.Moseley,PhD isanAssociateProfessorintheDepartmentofMolecularand CellularBiochemistryattheUniversityofKentucky,Lexington,KY,USA.Heisalsothe InformaticsCoreDirectorwithintheResourceCenterforStableIsotopeResolved Metabolomics,AssociateDirectorfortheInstituteforBiomedicalInformatics,andamember oftheMarkeyCancerCenter.Hisresearchinterestsincludedevelopingcomputational methods,tools,andmodelsforanalyzingandinterpretingmanytypesofbiologicaland biophysicaldatathatenablenewunderstandingofbiologicalsystemsandrelateddisease processes.Hisformaleducationspansmultipledisciplinesincludingchemistry, mathematics,computerscience,andbiochemistry,withexpertiseinalgorithmdevelopment, mathematicalmodeling,structuralbioinformatics,andsystemsbiochemistry,particularlyin thedevelopmentofautomatedanalysesofnuclearmagneticresonanceandmass spectrometrydataaswellasknowledge–dataintegration.

YanayOfran,PhD isaProfessorandheadoftheLaboratoryofFunctionalGenomicsand SystemsBiologyatBarIlanUniversityinTelAviv,Israel.Hisresearchfocuseson biomolecularrecognitionanditsroleinhealthanddisease.ProfessorOfranisalsothe founderofBiolojicDesign,abiopharmaceuticalcompanythatusesartificialintelligence approachestodesignepitope-specificantibodies.Heisalsotheco-founderofUkko,a biotechnologycompanythatusescomputationaltoolstodesignsafeproteinsforthefoodand agriculturesectors.

JosephN.Paulson,PhD isaStatisticalScientistwithinGenentech’sDepartmentof Biostatistics,SanFrancisco,CA,USA,workingondesigningclinicaltrialsandbiomarker discovery.Previously,hewasaResearchFellowintheDepartmentofBiostatisticsand ComputationalBiologyattheDana-FarberCancerInstituteandDepartmentofBiostatistics attheHarvardT.H.ChanSchoolofPublicHealth.HegraduatedwithaPhDinApplied Mathematics,Statistics,andScientificComputationfromtheUniversityofMaryland, CollegeParkwherehewasaNationalScienceFoundationGraduateFellow.Asastatistician andcomputationalbiologist,hisinterestsincludeclinicaltrialdesign,biomarkerdiscovery,

developmentofcomputationalmethodsfortheanalysisofhigh-throughputsequencingdata whileaccountingfortechnicalartifacts,andthemicrobiome.

SadhnaPhanse,MSc isaBioinformaticsAnalystattheDonnellyCentreforCellularand BiomolecularResearchattheUniversityofToronto,Toronto,Canada.Shehasbeenactivein thefieldofproteomicssince2006asamemberoftheEmiliresearchgroup.Hercurrentwork involvestheuseofbioinformaticsmethodstoinvestigatebiologicalsystemsandmolecular associationnetworksinhumancellsandmodelorganisms.

JohnQuackenbush,PhD isProfessorofComputationalBiologyandBioinformaticsandChair oftheDepartmentofBiostatisticsattheHarvardT.H.ChanSchoolofPublicHealth,Boston, MA,USA.HealsoholdsappointmentsintheChanningDivisionofNetworkMedicineof BrighamandWomen’sHospitalandattheDana-FarberCancerInstitute.Heisarecognized expertincomputationalandsystemsbiologyanditsapplicationstothestudyofawiderange ofhumandiseasesandthefactorsthatdrivethosediseasesandtheirresponsestotherapy.Dr. Quackenbushhaslongbeenanadvocateforopenscienceandreproducibleresearch.Asa foundingmemberandpastpresidentoftheFunctionalGenomicsDataSociety(FGED),he wasadeveloperoftheMinimalInformationAboutaMicroarrayExperiment(MIAME)and otherdata-reportingstandards.Dr.QuackenbushwashonoredbyPresidentBarackObama in2013asaWhiteHouseOpenScienceChampionofChange.

JonasReeb,MSc isaPhDstudentinthelaboratoryofBurkhardRostattheTechnical UniversityofMunich,Germany(TUM).DuringhisstudiesatTUM,hehasworkedon predictivemethodsfortheanalysisandevaluationoftransmembraneproteins;hehasalso workedontheNYCOMPSstructuralgenomicspipeline.Hisdoctoralthesisfocusesonthe effectofsequencevariantsandtheirprediction.

BurkhardRost,PhD isaprofessorandAlexandervonHumboldtAwardrecipientatthe TechnicalUniversityofMunich,Germany(TUM).Hewasthefirsttocombinemachine learningwithevolutionaryinformation,usingthiscombinationtoaccuratelypredict secondarystructure.Sincethattime,hisgrouphasrepeatedthissuccessindevelopingmany othertoolsthatareactivelyusedtopredictandunderstandaspectsofproteinstructureand function.Alltoolsdevelopedbyhisresearchgroupareavailablethroughthefirstinternet serverinthefieldofproteinstructureprediction(PredictProtein),aresourcethathasbeen onlineforover25years.Overthelastseveralyears,hisresearchgrouphasbeenshiftingits focustothedevelopmentofmethodsthatpredictandannotatetheeffectofsequence variationandtheirimplicationsforprecisionmedicineandpersonalizedhealth.

FabianSievers,PhD iscurrentlyapostdoctoralresearchfellowinthelaboratoryofDes HigginsatUniversityCollegeDublin,Ireland.Heworksonmultiplesequencealignment algorithmsand,inparticular,onthedevelopmentofClustalOmega.HereceivedhisPhDin mathematicsfromTrinityCollege,Dublinandhasworkedinindustryinthefieldsof algorithmdevelopmentandhigh-performancecomputing.

MichaelF.Sloma,PhD isadatascientistatXometry,Gaithersburg,MD,USA.Hereceivedhis BAdegreeinChemistryfromWellsCollege.HeearnedhisdoctoraldegreeinBiochemistry inthelaboratoryofDavidMathewsattheUniversityofRochester,wherehisresearch focusedoncomputationalmethodstopredictRNAstructurefromsequence.

W.ScottWatkins,MS isaresearcherandlaboratorymanagerintheDepartmentofHuman GeneticsattheUniversityofUtah,SaltLakeCity,UT,USA.Hehasalong-standinginterest inhumanpopulationgeneticsandevolution.Hiscurrentinterestsincludethedevelopment andapplicationofhigh-throughputcomputationalmethodstomobileelementbiology, congenitalheartdisease,andpersonalizedmedicine.

DavidS.Wishart,PhD isaDistinguishedUniversityProfessorintheDepartmentsof BiologicalSciencesandComputingScienceattheUniversityofAlberta,Edmonton,Alberta, Canada.Dr.Wisharthasbeendevelopingbioinformaticsprogramsanddatabasessincethe early1980sandhasmadebioinformaticsanintegralpartofhisresearchprogramfornearly fourdecades.Hisinterestinbioinformaticsledtothedevelopmentofanumberofwidely usedbioinformaticstoolsforstructuralbiology,bacterialgenomics,pharmaceuticalresearch, andmetabolomics.SomeofDr.Wishart’smostwidelyknownbioinformaticscontributions includetheChemicalShiftIndex(CSI)forproteinsecondarystructureidentificationby nuclearmagneticresonancespectroscopy,PHASTforbacterialgenomeannotation,the DrugBankdatabasefordrugresearch,andMetaboAnalystformetabolomicdataanalysis. Overthecourseofhisacademiccareer,Dr.Wisharthaspublishedmorethan400research papers,withmanybeinginthefieldofbioinformatics.Inadditiontohislong-standing interestinbioinformaticsresearch,Dr.Wisharthasbeenapassionateadvocatefor bioinformaticseducationandoutreach.HeisoneofthefoundingmembersoftheCanadian BioinformaticsWorkshops(CBW)–anationalbioinformaticstrainingprogramthathas taughtmorethan3000studentsoverthepasttwodecades.In2002heestablishedCanada’s firstundergraduatebioinformaticsdegreeprogramattheUniversityofAlbertaandhas personallymentorednearly130undergraduateandgraduatestudents,manyofwhomhave goneontoestablishsuccessfulcareersinbioinformatics.

TyraG.Wolfsberg,PhD istheAssociateDirectoroftheBioinformaticsandScientific ProgrammingCoreattheNationalHumanGenomeResearchInstitute(NHGRI),National InstitutesofHealth(NIH),Bethesda,MD,USA.Herresearchprogramfocusesondeveloping methodologiestointegratesequence,annotation,andexperimentallygenerateddatasothat benchbiologistscanquicklyandeasilyobtainresultsfortheirlarge-scaleexperiments.She maintainsalong-standingcommitmenttobioinformaticseducationandoutreach. Shehas authoredachapterongenomicdatabasesforpreviouseditionsofthistextbook,aswellasa chapterontheNCBIMapViewerfor CurrentProtocolsinBioinformatics and Current ProtocolsinHumanGenetics.Sheservesastheco-chairoftheNIHlectureseriesCurrent TopicsinGenomeAnalysis;theselecturesarearchivedonlineandhavebeenviewedover 1milliontimestodate.InadditiontoteachingbioinformaticscoursesatNHGRI,sheserved for13yearsasafacultymemberinbioinformaticsattheannualAACRWorkshopon MolecularBiologyinClinicalOncology.

MichaelZuker,PhD retiredasaProfessorofMathematicalSciencesatRensselaerPolytechnic Institute,Troy,NY,USA,in2016.HewasanAdjunctProfessorintheRNAInstituteatthe UniversityofAlbanyandremains affiliatedwiththeRNAInstitute.Heworksonthe developmentofalgorithmstopredictfolding,hybridization,andmeltingprofilesinnucleic acids.Hisnucleicacidfoldingandhybridizationwebservershavebeenrunningatthe UniversityofAlbanysince2010.Hiseducationalactivitiesincludedevelopingandteaching hisownbioinformaticscourseatRensselaerandparticipatinginbothaChautauquashort courseinbioinformaticsforcollegeteachersandanintensivebioinformaticscourseatthe UniversityofMichigan.HecurrentlyservesontheScientificAdvisoryBoardofExpansion Therapeutics,Inc.attheScrippsResearchInstituteinJupiter,Florida.

AbouttheCompanionWebsite

Thisbookisaccompaniedbyacompanionwebsite:

www.wiley.com/go/baxevanis/Bioinformatics_4e

Thewebsiteincludes:

• TestSamples

• WordSamples

ScanthisQRcodetovisitthecompanionwebsite.

BiologicalSequenceDatabases

Introduction

Overthepastseveraldecades,therehasbeenafeverishpushtounderstand,atthemost elementaryoflevels,whatconstitutesthebasic“bookoflife.”Biologists(andscientistsingeneral)aredriventounderstandhowthemillionsorbillionsofbasesinanorganism’sgenome containalloftheinformationneededforthecelltoconductthemyriadmetabolicprocesses necessaryfortheorganism’ssurvival–informationthatispropagatedfromgenerationto generation.Tohaveabasicunderstandingofhowthecollectionofindividualnucleotide basesdrivestheengineoflife,largeamountsofsequencedatamustbecollectedandstored inawaythatthesedatacanbesearchedandanalyzedeasily.Tothisend,much efforthas goneintothedesignandmaintenanceofbiologicalsequencedatabases.Thesedatabaseshave hadasignificantimpactontheadvancementofourunderstandingofbiologynotjustfrom acomputationalstandpointbutalsothroughtheirintegratedusealongsidestudiesbeing performedatthebench.

Thehistoryofsequencedatabasesbeganintheearly1960s,whenMargaretDayhoffand colleagues(1965)attheNationalBiomedicalResearchFoundation(NBRF)collectedallofthe proteinsequencesknownatthattime–all65ofthem–andpublishedtheminabookcalled the AtlasofProteinSequenceandStructure.Itisimportanttorememberthat,atthispointinthe historyofbiology,thefocuswasonsequencingproteinsthroughtraditionaltechniquessuch astheEdmandegradationratherthanonsequencingDNA,hencetheoverallsmallnumber ofavailablesequences.Bythelate1970s,whenasignificantnumberofnucleotidesequences becameavailable,thosewerealsoincludedinlatereditionsofthe Atlas.Asthiscollection evolved,itincludedtext-baseddescriptionstoaccompanytheproteinsequences,aswellas informationregardingtheevolutionofmanyproteinfamilies.Thiswork,inessence,wasthe firstannotatedsequencedatabase,eventhoughitwasinprintedform.Overtime,theamount ofdatacontainedinthe Atlas becameunwieldyandtheneedforittobeavailableinelectronic formbecameobvious.Fromtheearly1970stothelate1980s,thecontentsofthe Atlas were distributedelectronicallybyNBRF(andlaterbytheProteinInformationResource,orPIR)on magnetictape,andthedistributionincludedsomebasicprogramsthatcouldbeusedtosearch andevaluatedistantevolutionaryrelationships.

Thenextphaseinthehistoryofsequencedatabaseswasprecipitatedbytheveritableexplosionintheamountofnucleotidesequencedataavailabletoresearchersbytheendofthe 1970s.Toaddresstheneedformorerobustpublicsequencedatabases,theLosAlamosNational Laboratory(LANL)createdtheLosAlamosDNASequenceDatabasein1979,whichbecame knownasGenBankin1982(Bensonetal.2018).Meanwhile,theEuropeanMolecularBiology Laboratory(EMBL)createdtheEMBLNucleotideSequenceDataLibraryin1980.Throughout the1980s,EMBL(thenbasedinHeidelberg,Germany),LANL,and(later)theNationalCenter forBiotechnologyInformation(NCBI,partoftheNationalLibraryofMedicineattheNational InstitutesofHealth)jointlycontributedDNAsequencedatatothesedatabases.Thiswasdone Bioinformatics, FourthEdition.EditedbyAndreasD.Baxevanis,GaryD.Bader,andDavidS.Wishart. ©2020JohnWiley&Sons,Inc.Published2020byJohnWiley&Sons,Inc. CompanionWebsite:www.wiley.com/go/baxevanis/Bioinformatics_4e

byhavingteamsofcuratorsmanuallytranscribingandinterpretingwhatwaspublishedin printjournalstoanelectronicformatmoreappropriateforcomputationalanalyses.TheDNA DatabankofJapan(DDBJ;Kodamaetal.2018)joinedthisDNAdata-collectingcollaborationafewyearslater.Bythelate1980s,thequantityofDNAsequencedatabeingproduced wassooverwhelmingthatprintjournalsbeganaskingscientiststoelectronicallysubmittheir DNAsequencesdirectlytothesedatabases,ratherthanpublishingtheminprintedjournals orpapers.In1988,afterameetingofthesethreegroups(nowreferredtoastheInternational NucleotideSequenceDatabaseCollaboration,orINSDC;Karsch-Mizrachietal.2018),there wasanagreementtouseacommondataexchangeformatandtohaveeachdatabaseupdate onlytherecordsthatweredirectlysubmittedtoit.Thankstothisagreement,allthreecenters (EMBL,DDBJ,andNCBI)nowcollectdirectDNAsequencesubmissionsanddistributethem sothateachcenterhascopiesofallofthesequences,witheachcenteractingasaprimarydistributioncenterforthesesequences.DDBJ/EMBL/GenBankrecordsareupdatedautomatically every24hoursatallthreesites,meaningthatallsequencescanbefoundwithinDDBJ,the EuropeanNucleotideArchive(ENA;Silvesteretal.2018),andGenBankinshortorder.That said,eachdatabasewithintheINSDChasthefreedomtodisplayandannotatethesequence dataasitsees fit.

InparallelwiththeearlyworkbeingdoneonDNAsequencedatabases,thefoundations fortheSwiss-Protproteinsequencedatabasewerealsobeinglaidintheearly1980sbyAmos Bairoch,recountingitshistoryfromanengagingperspectiveinafirst-personreview(Bairoch 2000).BairochconvertedPIR’s Atlas toaformatsimilartothatusedbyEMBLforitsnucleotide database.Inthisinitialrelease,calledPIR+,additionalinformationabouteachoftheproteinswasadded,increasingitsvalueasacurated,well-annotatedsourceofinformationon proteins.Inthesummerof1986,BairochbegandistributingPIR+ ontheUSBIONET(aprecursortotheInternet),renamingitSwiss-Prot.Atthattime,itcontainedthegrandsumof 3900proteinsequences.Thiswasseenasanoverwhelmingamountofdata,instarkcontrast totoday’sstandards.AsSwiss-ProtandEMBLfollowedsimilarformats,anaturalcollaboration developedbetweenthesetwogroups,andthesecollaborativeeffortsstrengthenedwhenboth EMBL’sandSwiss-Prot’soperationsweremovedtoEMBL’sEuropeanBioinformaticsInstitute(EBI;Cooketal.2018)inHinxton,UK.Oneofthefirstcollaborativeprojectsundertaken bytheSwiss-ProtandEMBLteamswastocreateanewandmuchlargerproteinsequence databasesupplementtoSwiss-Prot.AsmaintainingthehighqualityofSwiss-Protentrieswasa time-consumingprocessinvolvingextensivesequenceanalysisanddetailedcurationbyexpert annotators(Apweiler2001),andtoallowthequickreleaseofproteindatanotyetannotated toSwiss-Prot’sstringentstandards,anewdatabasecalledTrEMBL(for“translationofEMBL nucleotidesequences”)wascreated.ThissupplementtoSwiss-Protinitiallyconsistedofcomputationallyannotatedsequenceentriesderivedfromthetranslationofallcodingsequences (CDSs)foundinINSDCdatabases.In2002,aneweffortinvolvingtheSwissInstituteofBioinformatics,EMBL-EBI,andPIRwaslaunched,calledtheUniProtconsortium(UniProtConsortium2017).ThiseffortgaverisetotheUniProtKnowledgebase(UniProtKB),consisting ofSwiss-Prot,TrEMBL,andPIR.AsimilareffortalsogaverisetotheNCBIProteinDatabase, bringingtogetherdatafromnumeroussourcesanddescribedmorefullyinthetextthatfollows.

Thecompletionofhumangenomesequencingandthesequencingofnumerousmodel genomes,aswellastheexistenceofagargantuannumberofsequencesingeneral,provides agoldenopportunityforbiologicalscientists,owingtotheinherentvalueofthesedata.At thesametime,thesheermagnitudeofdataalsopresentsaconundrumtotheinexperienced user,resultingnotjustfromthesizeofthe“sequenceinformationspace”butfromthe factthattheinformationspacecontinuestogetlargerbyleapsandbounds.Indeed,the sequencinglandscapehaschangedsignificantlyinrecentyearswiththedevelopmentofnew high-throughputtechnologiesthatgeneratemoreandmoresequencedatainawaythatis bestdescribedas“better,cheaper,faster,”withtheseadvancesfeedingintothe“insatiable appetite”thatscientistshaveformoreandmoresequencedata(Greenetal.2017).Giventhe inherentvalueofthedatacontainedwithinthesesequencedatabases,thischapterwillfocus

onprovidingthereaderwithasolidunderstandingofthesemajorpublicsequencedatabases, asa firststeptowardbeingabletoperformrobustandaccuratebioinformaticanalyses.

NucleotideSequenceDatabases

Asdescribedabove,themajorsourcesofnucleotidesequencedataarethedatabasesinvolved inINSDC–DDBJ,ENA,andGenBank–withneworupdateddatabeingsharedbetween thesethreeentitiesonceevery24hours.Thistransferisfacilitatedbytheuseofcommondata formatsforthekindsofinformationdescribedindetailbelow.

Theelementaryformatunderlyingtheinformationheldinsequencedatabasesisatextfile calledthe flatfile.Thecorrespondencebetweenindividualflatfileformatsgreatlyfacilitatesthe dailyexchangeofdatabetweeneachofthesedatabases.Inmostcases,fieldscanbemapped onaone-to-onebasisfromoneflatfileformattotheother.Overtime,variousfileformatshave beenadoptedandhavefoundcontinuedwidespreaduse;othershavefallentothewaysidefor avarietyofreasons.Thesuccessofagivenformatdependsonitsusefulnessinavarietyof contexts,aswellasitspowerineffectivelycontainingandrepresentingthetypesofbiological datathatneedtobearchivedandcommunicatedtoscientists.

Initssimplestform,asequencerecordcanberepresentedasastringofnucleotideswith somebasictagoridentifier.ThemostwidelyusedofthesesimpleformatsisFASTA,originallyintroducedaspartoftheFASTAsoftwaresuitedevelopedbyLipmanandPearson(1985) thatisdescribedindetailinChapter3.Thisinherentlysimpleformatprovidesaneasywayof handlingprimarydataforbothhumansandcomputers,takingthefollowingform.

>U54469.1

CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC

Forbrevity,onlythefirstfewlinesofthesequenceareshown.InthesimplestincarnationoftheFASTAformat,the“greaterthan”character(>)designatesthebeginningofanew sequencerecord;thislineisreferredtoasthe definitionline (commonlycalledthe“defline”). Auniqueidentifier–inthiscase,the accession.versionnumber (U54469.1)–isfollowedbythe nucleotidesequence,ineitheruppercaseorlowercaseletters,usuallywith60charactersper line.Theaccessionnumberisthenumberthatisalwaysassociatedwiththissequence(and shouldbecitedinpublications),whiletheversionnumbersuffixallowsuserstoeasilydeterminewhethertheyarelookingatthemostup-to-daterecordforaparticularsequence.The versionnumbersuffixisincrementedbyoneeachtimethesequenceisupdated.

Additionalinformationcanbeincludedonthedefinitionlinetomakethissimpleformata bitmoreinformative,asfollows.

>ENA|U54469|U54469.1Drosophilamelanogastereukaryoticinitiationfactor4E(eIF4E) gene,completecds,alternativelyspliced.

ThismodifiedFASTAdefinitionlinenowhasinformationonthesourcedatabase(ENA), itsaccession.versionnumber(U54469.1),andashortdescriptionofwhatbiologicalentityis representedbythesequence.

NucleotideSequenceFlatfiles:ADissection

Asflatfilesrepresenttheelementaryunitofinformationwithinsequencedatabasesandfacilitatetheinterchangeofinformationbetweenthesedatabases,itisimportanttounderstand

whateachindividual fieldwithintheflatfilerepresentsandwhatkindsofinformationcanbe foundinvaryingpartsoftherecord.Whilethereareminordifferencesinflatfileformats,they canallbeseparatedintothreemajorparts:the header,containinginformationanddescriptorspertainingtotheentirerecord;the featuretable,whichprovidesrelevantannotationsto thesequence;andthesequenceitself.

TheHeader

Theheaderisthemostdatabase-specificpartoftherecord.Here,wewillusetheENAversion oftherecordfordiscussion(showninitsentiretyinAppendix1.1),withthecorresponding DDBJandGenBankversionsoftheheaderappearinginAppendix1.2.Thefirstlineofthe recordprovidesbasicidentifyinginformationaboutthesequencecontainedintherecord, appropriatelynamedtheIDline;thiscorrespondstotheLOCUSlineinDDBJ/GenBank.

IDU54469;SV1;linear;genomicDNA;STD;INV;2881BP.

TheaccessionnumberisshownontheIDline,followedbyitssequenceversion(here,the firstversion,orSV1).AsthisisSV1,thisisequivalenttowritingU54469.1,asdescribedabove. ThisisthenfollowedbythetopologyoftheDNAmolecule(linear)andthemoleculetype (genomicDNA).ThenextelementrepresentstheENAdataclassforthissequence(STD, denotinga“standard”annotatedandassembledsequence).Dataclassesareusedtogroup sequencerecordswithinfunctionaldivisions,enablinguserstoqueryspecificsubsetsofthe database.AdescriptionofthesefunctionaldivisionscanbefoundinBox1.1.Finally,theID linepresentsthetaxonomicdivisionforthesequenceofinterest(INV,forinvertebrate;see InternetResources)anditslength(2881basepairs).Theaccessionnumberwillalsobeshown separatelyontheAClinethatimmediatelyfollowstheIDlines.

Box1.1FunctionalDivisionsinNucleotideDatabases

Theorganizationofnucleotidesequencerecordsintodiscretefunctionaltypesprovides awayforuserstoqueryspecificsubsetsoftherecordswithinthesedatabases.Inaddition,knowledgethataparticularsequenceisfromagiventechnique-orienteddatabase allowsuserstointerpretthedatafromtheproperbiologicalpointofview.Severalofthese divisionsaredescribedbelow,andexamplesofeachofthesefunctionaldivisions(called “dataclasses”byENA)canbefoundbyfollowingtheexamplelinkslistedontheENAData FormatspagelistedintheInternetResourcessectionofthischapter.

CONConstructed(or“contigged”)recordsofchromosomes,genomes,andotherlongDNA sequencesresultingfromwhole-genomesequencingefforts.Therecordsinthis divisiondonotcontainsequencedata;rather,theycontaininstructionsforthe assemblyofsequencedatafoundwithinmultipledatabaserecords.

ESTExpressedSequenceTags.Theserecordscontainshort(300–500bp)singlereads frommRNA(cDNA)thatareusuallyproducedinlargenumbers.ESTsrepresenta snapshotofwhatisexpressedinagiventissueoratagivendevelopmentalstage. Theyrepresenttags–somecoding,somenot–ofexpressionforagivencDNAlibrary.

GSSGenomeSurveySequences.SimilartotheESTdivision,exceptthatthesequencesare genomicinorigin.TheGSSdivisioncontains(butisnotlimitedto)single-passread genomesurveysequences,bacterialartificialchromosome(BAC)oryeastartificial chromosome(YAC)ends,exon-trappedgenomicsequences,andAlupolymerasechain reaction(PCR)sequences.

HTGHigh-ThroughputGenomesequences.UnfinishedDNAsequencesgeneratedby high-throughputsequencingcenters,madeavailableinanexpeditedfashiontothe scientificcommunityforhomologyandsimilaritysearches.Entriesinthisdivision containkeywordsindicatingitsphasewithinthesequencingprocess.Oncefinished, HTGsequencesaremovedintotheappropriatedatabasetaxonomicdivision.

STDArecordcontainingastandard,annotated,andassembledsequence.

STSSequence-TaggedSites.Short(200–500bp)operationallyuniquesequencesthat identifyacombinationofprimerpairsusedinaPCRassay,generatingareagentthat mapstoasinglepositionwithinthegenome.TheSTSdivisionisintendedtofacilitate cross-comparisonofSTSswithsequencesinotherdivisionsforthepurposeof correlatingmappositionsofanonymoussequenceswithknowngenes.

WGSWhole-GenomeShotgunsequences.Sequencedatafromprojectsusingshotgun approachesthatgeneratelargenumbersofshortsequencereadsthatcanthenbe assembledbycomputeralgorithmsintosequencecontigs,higher-orderscaffolds,and sometimesintonear-chromosome-orchromosome-lengthsequences.

FollowingtheIDlineareoneormoredatelines(denotedbyDT),indicatingwhentheentry was firstcreatedorlastupdated.Foroursequenceofinterest,theentrywasoriginallycreated onMay19,1996andwaslastupdatedinENAonJune23,2017:

DT19-MAY-1996(Rel.47,Created)

DT23-JUN-2017(Rel.133,Lastupdated,Version5)

Thereleasenumberineachlineindicatesthefirstquarterlyreleasemade after theentry wascreatedorlastupdated.Theversionnumberfortheentryappearsonthesecondlineand allowstheusertodetermineeasilywhethertheyarelookingatthemostup-to-daterecord foraparticularsequence.Pleasenotethatthisisdifferentfromtheaccession.versionformat describedabove–whilesomeelementoftherecordmayhavechanged,thesequencemayhave remainedthesame,sothesetwodifferenttypesofversionnumbersmaynotalwayscorrespond tooneanother.

Thenextpartoftheheadercontainsthedefinitionlines,providingasuccinctdescription ofthekindsofbiologicalinformationcontainedwithintherecord.Thedefinitionline(DEin ENA,DEFINITIONinDDBJ/GenBank)takesthefollowingform.

DEDrosophilamelanogastereukaryoticinitiationfactor4E(eIF4E)gene, DEcompletecds,alternativelyspliced.

Muchcareistakeninthegenerationofthesedefinitionlinesand,althoughmanyofthem canbegeneratedautomaticallyfromotherpartsoftherecord,theyarereviewedtoensure thatconsistencyandrichnessofinformationaremaintained.Obviously,itisquiteimpossible tocaptureallofthebiologyunderlyingasequenceinasinglelineoftext,butthatwealthof informationwillfollowsoonenoughindownstreampartsofthesamerecord.

Continuingdowntheflatfilerecord,onefindsthefulltaxonomicinformationonthe sequenceofinterest.TheOSline(orSOURCElineinDDBJ/GenBank)providesthepreferred scientificnamefromwhichthesequencewasderived,followedbythecommonnameofthe organisminparentheses.TheOClines(orORGANISMlinesinDDBJ/GenBank)contain thecompletetaxonomicclassificationofthesourceorganism.Theclassificationislisted top-down,asnodesinataxonomictree,withthemostgeneralgrouping(Eukaryota)given first.

OSDrosophilamelanogaster(fruitfly)

OCEukaryota;Metazoa;Ecdysozoa;Arthropoda;Hexapoda;Insecta;Pterygota; OCNeoptera;Holometabola;Diptera;Brachycera;Muscomorpha;Ephydroidea; OCDrosophilidae;Drosophila;Sophophora.

Eachrecordmusthaveatleastonereferenceorcitation,notedwithinwhatarecalled referenceblocks.Thesereferenceblocksofferscientificcreditandsetacontextexplainingwhythis particularsequencewasdetermined.Thereferenceblockstakethefollowingform.

RN[1]

RP1-2881

RXDOI;.1074/jbc.271.27.16393.

RXPUBMED;8663200.

RALavoieC.A.,LachanceP.E.,SonenbergN.,LaskoP.; RT"AlternativelysplicedtranscriptsfromtheDrosophilaeIF4Egeneproduce RTtwodifferentCap-bindingproteins"; RLJBiolChem271(27):16393-16398(1996).

XX

RN[2]

RP1-2881

RALaskoP.F.; RT; RLSubmitted(09-APR-1996)totheINSDC.

RLPaulF.Lasko,Biology,McGillUniversity,1205AvenueDocteurPenfield, RLMontreal,QCH3A1B1,Canada

Inthiscase,tworeferencesareshown,onereferringtoapublishedpaperandtheother referringtothesubmissionofthesequencerecorditself.Intheexampleabove,thesecond blockprovidesinformationontheseniorauthorofthepaperlistedinthe firstblock,aswell astheauthor’spostaladdress.Whilethedateshowninthesecondblockindicateswhenthe sequence(andaccompanyinginformation)wassubmittedtothedatabase,itdoesnotindicate whentherecordwasfirstmadepublic,sonoinferencesorclaimsbasedonfirstpublicrelease canbemadebasedonthisdate.Additionalsubmitterblocksmaybeaddedtotherecordeach timethesequenceisupdated.

SomeheadersmaycontainCOMMENT(DDBJ/GenBank)orCC(ENA)lines.Theselines canincludeagreatvarietyofnotesandcomments(descriptors)thatrefertotheentire record.Often,genomecenterswillusetheselinestoprovidecontactinformationandto conferacknowledgments.Commentsalsomayincludethehistoryofthesequence.Ifthe sequenceofaparticularrecordisupdated,thecommentwillcontainapointertotheprevious versionsoftherecord.Alternatively,ifanearlierversionoftherecordisretrieved,the commentwillpointforwardtothenewerversion,aswellasbackwards,iftherewasastill earlierversion.Finally,therearedatabasecross-referencelines(markedDR)thatprovide linkstoallieddatabasescontaininginformationrelatedtothesequenceofinterest.Here,a cross-referencetoFlyBasecanbeseeninthecompleteheaderforthisrecordinAppendix1.1. NotethatthecorrespondingDDBJ/GenBankheaderinAppendix1.2doesnotcontainthese cross-references.

TheFeatureTable

EarlyoninthecollaborationbetweenINSDCpartnerorganizations,aneffortwasmadeto comeupwithacommonwaytorepresentthebiologicalinformationfoundwithinagiven databaserecord.Thiscommonrepresentationiscalledthe featuretable,consistingof feature keys (asinglewordorabbreviationindicatingthedescribedbiologicalproperty), location informationdenotingwherethefeatureislocatedwithinthesequence,andadditional qualifiers providingadditionaldescriptiveinformationaboutthefeature.TheonlineINSDCfeaturetable documentationisextensiveanddescribesingreatdetailwhatfeaturesareallowedandwhat qualifierscanbeusedwitheachindividualfeature.WordingwithinthefeaturetableusescommonbiologicalresearchterminologywhereverpossibleandisconsistentbetweenDDBJ,ENA, andGenBankentries.

Here,wewilldissectthefeaturetablefortheeukaryotictranscriptionfactor4Egenefrom Drosophilamelanogaster,showninitsentiretyinbothAppendices1.3(inENAformat)and 1.4(inDDBJ/GenBankformat).Thisparticularsequenceisalternativelyspliced,producing twodistinctgeneproducts,4E-Iand4E-II.Thefirstblockofinformationinthefeaturetableis alwaysthesourcefeature,indicatingthebiologicalsourceofthesequenceandadditionalinformationrelatingtotheentiresequence.ThisfeaturemustbepresentinallINSDCentries,asall DNAorRNAsequencesderivefromsomespecificbiologicalsource,includingsyntheticDNA.

FTsource1..2881

FT/organism="Drosophilamelanogaster"

FT/chromosome="3"

FT/map="67A8-B2"

FT/mol_type="genomicDNA"

FT/db_xref="taxon:7227"

FTgene80..2881

FT/gene="eIF4E"

Inthe firstlineofthesourcekey,noticethatthenumberingschemeshowstherangeof positionscoveredbythisfeaturekeyastwonumbersseparatedbytwodots(1..2881).As thesourcekeypertainstotheentiresequence,wecaninferthatthesequencedescribedin thisentryis2881nucleotidesinlength.Thevariouswaysinwhichthelocationofanygiven featurecanbeindicatedareshowninTable1.1,accountingforawiderangeofbiological scenarios.Thequalifiersthenfollow,eachprecededbyaslash.Thefullscientificnameof theorganismisprovided,asarespecificmappingcoordinates,indicatingthatthissequence isatmaplocation67A8-B2onchromosome3.Alsoindicatedisthetypeofmoleculethat wassequenced(genomicDNA).Finally,thelastlineindicatesadatabasecross-reference (abbreviatedasdb_xref)totheNCBItaxonomydatabase,wheretaxon7227correspondsto D.melanogaster.Ingeneral,thesecross-referencesarecontrolledqualifiersthatallowentries tobeconnectedtoanexternaldatabase,usinganidentifierthatisuniquetothatexternal database.Followingthesourceblockaboveisthegenefeature,indicatingthatthegene itselfisasubsetoftheentiresequenceinthisentry,startingatposition80andendingat position2881.

FTmRNAjoin(80..224,892..1458,1550..1920,1986..2085,2317..2404, FT2466..2881)

FT/gene="eIF4E"

FT/product="eukaryoticinitiationfactor4E-I"

FTmRNAjoin(80..224,1550..1920,1986..2085,2317..2404,2466..2881)

FT/gene="eIF4E"

FT/product="eukaryoticinitiationfactor4E-II"

Table1.1 Indicatinglocationswithinthefeaturetable.

345

345..500

<345..500

345..>500

<1..888

Singlepositionwithinthesequence

Acontinuousrangeofpositionsboundedbyandincludingthe indicatedpositions

Acontinuousrangeofpositions,wheretheexactlowerboundary isnotknown;thefeaturebeginssomewherepriortoposition345 butendsatposition500

Acontinuousrangeofpositions,wheretheexactupperboundary isnotknown;thefeaturebeginsatposition345butends somewhereafterposition500

Thefeaturestartsbeforethefirstsequencedbaseandcontinuesto position888 (102.110)

123 ̂ 124

123 ̂ 177

Indicatesthattheexactlocationisunknown,butthatitisoneof thepositionsbetween102and110,inclusive

Pointstoasite between positions123and124

Pointstoasite between twoadjacentnucleotidesoraminoacids anywherebetweenpositions123and177 join(12..78,134..202) Regions12–78and134–202arejoinedtoformonecontiguous sequence complement(4918..5126)

J00194:100..202

Thesequencecomplementarytothatfoundfrom4918to5126in thesequencerecord

Positions100–202,inclusive,intheentryinthisdatabasehaving accessionnumberJ00194

ThenextfeatureinthisexampleindicateswhichregionsformthetwomRNAtranscriptsfor thisgene,the firstforeukaryoticinitiationfactor4E-Iandthesecondforeukaryoticinitiation factor4E-II.Inthefirstcase(shownabove),the join lineindicatesthatsixdistinctDNA segmentsaretranscribedtoformthematureRNAtranscriptwhile,inthesecondcase,the secondregionismissing,withonlyfivedistinctDNAsegmentstranscribedintothemature RNAtranscript–hencethetwosplicevariantsthatareultimatelyencodedbythismolecule.

FTCDSjoin(201..224,1550..1920,1986..2085,2317..2404,2466..2629)

FT/codon_start=1

FT/gene="eIF4E"

FT/product="eukaryoticinitiationfactor4E-II" FT/note="Method:conceptualtranslationwithpartialpeptide FTsequencing"

FT/db_xref="GOA:P48598"

FT/db_xref="InterPro:IPR001040" FT/db_xref="InterPro:IPR019770" FT/db_xref="InterPro:IPR023398" FT/db_xref="PDB:4AXG"

FT/db_xref="PDB:4UE8"

FT/db_xref="PDB:4UE9"

FT/db_xref="PDB:4UEA"

FT/db_xref="PDB:4UEB" FT/db_xref="PDB:4UEC"

FT/db_xref="PDB:5ABU"

FT/db_xref="PDB:5ABV"

FT/db_xref="PDB:5T47"

FT/db_xref="PDB:5T48"

FT/db_xref="UniProtKB/Swiss-Prot:P48598" FT/protein_id="AAC03524.1"

FT/translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGE FTPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTVED FTFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDNLWL FTDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDALRLGR FTNNSLQYQLHKDTMVKQGSNVKSIYTL"

FollowingthemRNAfeatureistheCDSfeatureshownabove,describingtheregionthat ultimatelyencodestheproteinproduct.Focusingjustoneukaryoticinitiationfactor4E-II,the CDSfeaturealsoshowsa join linewithcoordinatesthatareslightlydifferentfromthose showninthemRNAfeature,specificallyatthebeginningandendpositions.Thedifference liesinthefactthatthe5′ and3′ untranslatedregions(UTRs)areincludedinthemRNAfeaturebutnotintheCDSfeature.TheCDSfeaturecorrespondstothesequenceofaminoacids foundinthetranslatedproteinproductwhosesequenceisshowninthe /translation qualifierabove.The /codon_start qualifierindicatesthattheaminoacidtranslationofthefirst codonbeginsatthefirstpositionofthisjoinedregion,withnooffset.

The /protein_id qualifiershowstheaccessionnumberforthecorrespondingentryin theproteindatabases(AAC03524.1)andishyperlinked,enablingtheusertogodirectlyto thatentry.Theseuniqueidentifiersusea“3 + 5”format–threeletters,followedbyfivenumbers.Versionsareindicatedbythedecimalthatfollows;whentheproteinsequenceinthe recordchanges,theversionisincrementedbyone.Theassignmentofageneproductorproteinname(viathe /protein qualifier)oftenissubjective,sometimesbeingassignedviaweak similaritiestoother(andsometimespoorlyannotated)sequences.Giventhepotentialforthe transitivepropagationofpoorannotations(thatis,baddatatendtobegetmorebaddata), usersareadvisedtoconsult curated nucleotideandproteinsequencedatabasesforthemost up-to-date,accurateinformationregardingtheputativefunctionofagivensequence.Finally, noticetheextensivecross-referencingviathe /db_xref qualifiertoentriesinInterPro,the

ProteinDataBank(PDB),andUniProtKB/Swiss-Prot,aswellastoaGeneOntologyannotation (GOA;GeneOntologyConsortium2017).

Implicitinthesourcefeatureandtheorganismthatisassignedtoitisthegeneticcodeused totranslatethenucleicacidsequenceintoaproteinsequencewhenaCDSfeatureispresent intherecord.Also,theDNA-centricnatureofthesefeaturetablesmeansthatallfeaturesare mappedthroughaDNAcoordinatesystem,notthatofaminoacidreferencepoints,asshown intheexamplesinAppendices1.3and1.4.

SQSequence2881BP;849A;699C;585G;748T;0other; cggttgcttgggttttataacatcagtcagtgacaggcatttccagagttgccctgttca60 acaatcgatagctgcctttggccaccaaaatcccaaacttaattaaagaattaaataatt120 cgaataataattaagcccagtaacctacgcagcttgagtgcgtaaccgatatctagtata180

.<truncatedforbrevity >

aaacggaaccccctttgttatcaaaaatcggcataatataaaatctatccgctttttgta2820 gtcactgtcaataatggattagacggaaaagtatattaataaaaacctacattaaaaccg2880 g 2881 //

Finally,attheendofeverynucleotidesequencerecord,one findstheactualnucleotide sequence,with60basesperrow.Notethat,intheSQlinesignalingthebeginningofthissection oftherecord,notonlyistheoveralllengthofthesequenceprovided,butacountofhowmany ofeachindividualtypeofnucleotidebaseisalsoprovided,makingitquiteeasytocomputethe GCcontentofthissequence.

GraphicalInterfaces

Graphicalinterfaceshavebeendevelopedtofacilitatetheinterpretationofthedatafound withintext-basedflatfiles,withanexampleofthegraphicalviewoftheENArecordforour sequenceofinterest(U54469.1)showninFigure1.1.Thesegraphicalviewsareparticularly usefulwhenthereisalonglistofdocumentedbiologicalfeatureswithinthefeaturetable, enablingtheusertovisualizepotentialinteractionsorrelationshipsbetweenbiological features.Anadditionalexampleoftheuseofgraphicalviewstoassistintheinterpretation oftheinformationfoundwithinadatabaserecordisprovidedinthediscussionoftheNCBI EntrezdiscoverypathwayinChapter2,aswellaslaterinthischapter.

RefSeq

Asonemightexpect,especiallygiventhebreakneckspeedatwhichDNAsequencedata arecurrentlybeingproduced,thereisasignificantamountofredundancywithinthemajor sequencedatabases,withagoodnumberofsequencesbeingrepresentedmorethanonce. Thisisoftenproblematicfortheenduser,whomayfindthemselvesconfusedastowhich sequencetouseafterperformingasearchthatreturnsnumerousresults.Toaddressthis issue,NCBIdevelopedRefSeq,thegoalofwhichistoprovideasinglereferencesequence foreachmoleculeofthecentraldogma–DNA,RNA,andprotein.Thedistinguishing featuresofRefSeqgobeyonditsnon-redundantnature,withindividualentriesincludingthe biologicalattributesofthegene,genetranscript,orprotein.RefSeqentriesencompassawide taxonomicrange,andentriesareupdatedandcuratedonanongoingbasistoreflectcurrent knowledgeabouttheindividualentries.AdditionalinformationonRefSeqcanbefound inBox1.2.

Figure1.1 ThelandingpageforENArecordU54469.1,providingagraphicalviewofbiologicalfeaturesfoundwithinthesequenceofthe Drosophilamelanogaster eukaryoticinitiationfactor4E(eIF4E)gene.Thetrackswithinthegraphicalviewshowthepositionofthegene, mRNAs,andcodingregions(markedCDS)withinthe2881bpsequencereportedinthisrecord.

Box1.2RefSeq

Thefirstseveralchaptersofthisbookdescribeavarietyofwaysinwhichsequencedata andsequenceannotationsfindtheirwayintopublicdatabases.Whilethecombinationof dataderivedfromsystematicsequencingprojectsandindividualinvestigators’laboratoriesyieldsarichandhighlyvaluablesetofsequencedata,someproblemsareapparent. Themostimportantissueisthatasinglebiologicalentitymayberepresentedbymany differententriesinvariousdatabases.Italsomaynotbeclearwhetheragivensequence hasbeenexperimentallydeterminedorissimplytheresultofacomputationalprediction. Toaddresstheseissues,NCBIdevelopedtheRefSeqproject,themajorgoalofwhich istoprovideareferencesequenceforeachmoleculeinthecentraldogma(DNA,mRNA, andprotein).Aseachbiologicalentityisrepresentedonlyonce,RefSeqis,bydefinition, non-redundant.NucleotideandproteinsequencesinRefSeqareexplicitlylinkedtoone

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.