Issuu

Get Complete eBook Download Link below for instant download https://browsegrades.net/documents/2 86751/ebook-payment-link-for-instantdownload-after-payment

CompactDataStructures APracticalApproach

Compactdatastructureshelprepresentdatainreducedspacewhileallowingquerying, navigating,andoperatingitincompressedform.Theyareessentialtoolsforefficiently handlingmassiveamountsofdatabyexploitingthememoryhierarchy.Theyalsoreduce theresourcesneededindistributeddeploymentsandmakebetteruseofthelimitedmemory inlow-enddevices.

Thefieldhasdevelopedrapidly,reachingalevelofmaturitythatallowspractitioners andresearchersinapplicationareastobenefitfromtheuseofcompactdatastructures.This firstcomprehensivebookonthetopicfocusesonthestructuresthataremostrelevantfor practicaluse.Readerswilllearnhowthestructureswork,howtochoosetherightonesfor theirapplicationscenario,andhowtoimplementthem.Researchersandstudentsinthearea willfindinthebookadefinitiveguidetothestateoftheartincompactdatastructures.

GonzaloNavarroisProfessorofComputerScienceattheUniversityofChile.Hehas workedfor20yearsontherelationbetweencompressionanddatastructures.Hehas directedorparticipatedinnumerouslargeprojectsonwebresearch,informationretrieval, compresseddatastructures,andbioinformatics.HeistheEditorinChiefofthe ACMJournalofExperimentalAlgorithmics andalsoamemberoftheeditorialboardofthejournals InformationRetrieval and InformationSystems.Hispublicationsincludethebook Flexible PatternMatchinginStrings (withM.Raffinot),20bookchapters,morethan100journal papersand200conferencepapers;hehasalsochairedeightinternationalconferences.

CompactDataStructures APracticalApproach

GonzaloNavarro

DepartmentofComputerScience, UniversityofChile

OneLibertyPlaza,20thFloor,NewYork,NY10006,USA

CambridgeUniversityPressispartoftheUniversityofCambridge.

ItfurtherstheUniversity’smissionbydisseminatingknowledgeinthepursuitof education,learning,andresearchatthehighestinternationallevelsofexcellence.

www.cambridge.org

Informationonthistitle: www.cambridge.org/9781107152380

©GonzaloNavarro2016

Thispublicationisincopyright.Subjecttostatutoryexception andtotheprovisionsofrelevantcollectivelicensingagreements, noreproductionofanypartmaytakeplacewithoutthewritten permissionofCambridgeUniversityPress.

Firstpublished2016

PrintedintheUnitedStatesofAmericabySheridanBooks,Inc.

AcataloguerecordforthispublicationisavailablefromtheBritishLibrary.

LibraryofCongressCataloging-in-PublicationData

Names:Navarro,Gonzalo,1969–author.

Title:Compactdatastructures:apracticalapproach/GonzaloNavarro, UniversidaddeChile.

Description:NewYork,NY:UniversityofCambridge,[2016]|Includes bibliographicalreferencesandindex.

Identifiers:LCCN2016023641|ISBN9781107152380(hardback:alk.paper)

Subjects:LCSH:Datastructures(Computerscience)|Computeralgorithms.

Classification:LCCQA76.9.D35N382016|DDC005.7/3–dc23

LCrecordavailableat https://lccn.loc.gov/2016023641

ISBN978-1-107-15238-0Hardback

CambridgeUniversityPresshasnoresponsibilityforthepersistenceoraccuracyof URLsforexternalorthird-partyInternetWebsitesreferredtointhispublicationand doesnotguaranteethatanycontentonsuchWebsitesis,orwillremain,accurateor appropriate.

AAylén,FacundoyMartina,queaúnmecreen. ABetina,queaúnmesoporta.

Amipadre,amihermana,yalamemoriademimadre.

Contents ListofAlgorithmspage xiii Foreword xvii Acknowledgments xix 1Introduction 1 1.1 WhyCompactDataStructures? 1 1.2 WhyThisBook? 3 1.3 Organization 4 1.4 SoftwareResources 6 1.5 MathematicsandNotation 7 1.6 BibliographicNotes 10 2EntropyandCoding 14 2.1 Worst-CaseEntropy 14 2.2 ShannonEntropy 16 2.3 EmpiricalEntropy 17 2.3.1 BitSequences 18 2.3.2 SequencesofSymbols 20 2.4 High-OrderEntropy 21 2.5 Coding 22 2.6 HuffmanCodes 25 2.6.1 Construction 25 2.6.2 EncodingandDecoding 26 2.6.3 CanonicalHuffmanCodes 27 2.6.4 BetterthanHuffman 30 2.7 Variable-LengthCodesforIntegers 30 2.8 Jensen’sInequality 33 2.9 Application:PositionalInvertedIndexes 35 2.10 Summary 36 2.11 BibliographicNotes 36 vii

viiicontents 3Arrays 39 3.1 ElementsofFixedSize 40 3.2 ElementsofVariableSize 45 3.2.1 SampledPointers 46 3.2.2 DensePointers 47 3.3 PartialSums 48 3.4 Applications 49 3.4.1 Constant-TimeArrayInitialization 49 3.4.2 DirectAccessCodes 53 3.4.3 Elias-FanoCodes 57 3.4.4 DifferentialEncodingsandInvertedIndexes 59 3.4.5 CompressedTextCollections 59 3.5 Summary 61 3.6 BibliographicNotes 61 4Bitvectors 64 4.1 Access 65 4.1.1 Zero-OrderCompression 65 4.1.2 High-OrderCompression 71 4.2 Rank 73 4.2.1 SparseSampling 73 4.2.2 ConstantTime 74 4.2.3 RankonCompressedBitvectors 76 4.3 Select 78 4.3.1 ASimpleHeuristic 78 4.3.2 An O (loglog n ) TimeSolution 80 4.3.3 ConstantTime 81 4.4 VerySparseBitvectors 82 4.4.1 Constant-TimeSelect 83 4.4.2 SolvingRank 83 4.4.3 BitvectorswithRuns 86 4.5 Applications 87 4.5.1 PartialSumsRevisited 87 4.5.2 PredecessorsandSuccessors 89 4.5.3 Dictionaries,Sets,andHashing 91 4.6 Summary 98 4.7 BibliographicNotes 98 5Permutations 103 5.1 InversePermutations 103 5.2 PowersofPermutations 106 5.3 CompressiblePermutations 108 5.4 Applications 115 5.4.1 Two-DimensionalPoints 115 5.4.2 InvertedIndexesRevisited 116 5.5 Summary 117 5.6 BibliographicNotes 117

contentsix 6Sequences 120 6.1 UsingPermutations 121 6.1.1 Chunk-LevelGranularity 121 6.1.2 OperationswithinaChunk 123 6.1.3 Construction 126 6.1.4 SpaceandTime 127 6.2 WaveletTrees 128 6.2.1 Structure 128 6.2.2 SolvingRankandSelect 132 6.2.3 Construction 134 6.2.4 CompressedWaveletTrees 136 6.2.5 WaveletMatrices 139 6.3 AlphabetPartitioning 150 6.4 Applications 155 6.4.1 CompressiblePermutationsAgain 155 6.4.2 CompressedTextCollectionsRevisited 157 6.4.3 Non-positionalInvertedIndexes 157 6.4.4 RangeQuantileQueries 159 6.4.5 RevisitingArraysofVariable-LengthCells 160 6.5 Summary 161 6.6 BibliographicNotes 162 7Parentheses 167 7.1 ASimpleImplementation 170 7.1.1 RangeMin-MaxTrees 170 7.1.2 ForwardandBackwardSearching 175 7.1.3 RangeMinimaandMaxima 180 7.1.4 RankandSelectOperations 188 7.2 ImprovingtheComplexity 188 7.2.1 QueriesinsideBuckets 190 7.2.2 ForwardandBackwardSearching 191 7.2.3 RangeMinimaandMaxima 196 7.2.4 RankandSelectOperations 200 7.3 Multi-ParenthesisSequences 200 7.3.1 NearestMarkedAncestors 201 7.4 Applications 202 7.4.1 SuccinctRangeMinimumQueries 202 7.4.2 XMLDocuments 204 7.5 Summary 207 7.6 BibliographicNotes 207 8Trees 211 8.1 LOUDS:ASimpleRepresentation 212 8.1.1 BinaryandCardinalTrees 219 8.2 BalancedParentheses 222 8.2.1 BinaryTreesRevisited 228

xcontents 8.3 DFUDSRepresentation 233 8.3.1 CardinalTreesRevisited 240 8.4 LabeledTrees 241 8.5 Applications 245 8.5.1 RoutinginMinimumSpanningTrees 246 8.5.2 GrammarCompression 248 8.5.3 Tries 252 8.5.4 LZ78Compression 259 8.5.5 XMLandXPath 262 8.5.6 Treaps 264 8.5.7 IntegerFunctions 266 8.6 Summary 272 8.7 BibliographicNotes 272 9Graphs 279 9.1 GeneralGraphs 281 9.1.1 UsingBitvectors 281 9.1.2 UsingSequences 281 9.1.3 UndirectedGraphs 284 9.1.4 LabeledGraphs 285 9.1.5 Construction 289 9.2 ClusteredGraphs 291 9.2.1 K 2 -TreeStructure 291 9.2.2 Queries 292 9.2.3 ReducingSpace 294 9.2.4 Construction 296 9.3 K -PageGraphs 296 9.3.1 One-PageGraphs 297 9.3.2 K -PageGraphs 299 9.3.3 Construction 307 9.4 PlanarGraphs 307 9.4.1 OrderlySpanningTrees 308 9.4.2 Triangulations 315 9.4.3 Construction 317 9.5 Applications 327 9.5.1 BinaryRelations 327 9.5.2 RDFDatasets 328 9.5.3 PlanarRouting 330 9.5.4 PlanarDrawings 336 9.6 Summary 338 9.7 BibliographicNotes 338 10Grids 347 10.1 WaveletTrees 348 10.1.1 Counting 350 10.1.2 Reporting 353 10.1.3 SortedReporting 355

contentsxi 10.2 K 2 -Trees 357 10.2.1 Reporting 359 10.3 WeightedPoints 362 10.3.1 WaveletTrees 362 10.3.2 K 2 -Trees 365 10.4 HigherDimensions 371 10.5 Applications 372 10.5.1 DominatingPoints 372 10.5.2 GeographicInformationSystems 373 10.5.3 ObjectVisibility 377 10.5.4 Position-RestrictedSearchesonSuffixArrays 379 10.5.5 SearchingforFuzzyPatterns 380 10.5.6 IndexedSearchinginGrammar-CompressedText 382 10.6 Summary 388 10.7 BibliographicNotes 388 11Texts 395 11.1 CompressedSuffixArrays 397 11.1.1 Replacing A with 398 11.1.2 Compressing 399 11.1.3 BackwardSearch 401 11.1.4 LocatingandDisplaying 403 11.2 TheFM-Index 406 11.3 High-OrderCompression 409 11.3.1 TheBurrows-WheelerTransform 409 11.3.2 High-OrderEntropy 410 11.3.3 Partitioning L intoUniformChunks 413 11.3.4 High-OrderCompressionof 414 11.4 Construction 415 11.4.1 SuffixArrayConstruction 415 11.4.2 BuildingtheBWT 416 11.4.3 Building 418 11.5 SuffixTrees 419 11.5.1 LongestCommonPrefixes 419 11.5.2 SuffixTreeOperations 420 11.5.3 ACompactRepresentation 424 11.5.4 Construction 426 11.6 Applications 429 11.6.1 FindingMaximalSubstringsofaPattern 429 11.6.2 LabeledTreesRevisited 432 11.6.3 DocumentRetrieval 438 11.6.4 XMLRetrievalRevisited 441 11.7 Summary 442 11.8 BibliographicNotes 442

xiicontents 12DynamicStructures 450 12.1 Bitvectors 450 12.1.1 SolvingQueries 452 12.1.2 HandlingUpdates 452 12.1.3 CompressedBitvectors 461 12.2 ArraysandPartialSums 463 12.3 Sequences 465 12.4 Trees 467 12.4.1 LOUDSRepresentation 469 12.4.2 BPRepresentation 472 12.4.3 DFUDSRepresentation 474 12.4.4 DynamicRangeMin-MaxTrees 476 12.4.5 LabeledTrees 479 12.5 GraphsandGrids 480 12.5.1 DynamicWaveletMatrices 480 12.5.2 Dynamic k 2 -Trees 482 12.6 Texts 485 12.6.1 Insertions 485 12.6.2 DocumentIdentifiers 486 12.6.3 Samplings 486 12.6.4 Deletions 490 12.7 MemoryAllocation 492 12.8 Summary 494 12.9 BibliographicNotes 494 13RecentTrends 501 13.1 EncodingDataStructures 502 13.1.1 EffectiveEntropy 502 13.1.2 TheEntropyofRMQs 503 13.1.3 ExpectedEffectiveEntropy 504 13.1.4 OtherEncodingProblems 504 13.2 RepetitiveTextCollections 508 13.2.1 Lempel-ZivCompression 509 13.2.2 Lempel-ZivIndexing 513 13.2.3 FasterandLargerIndexes 516 13.2.4 CompressedSuffixArraysandTrees 519 13.3 SecondaryMemory 523 13.3.1 Bitvectors 524 13.3.2 Sequences 527 13.3.3 Trees 528 13.3.4 GridsandGraphs 530 13.3.5 Texts 534 Index 549

ListofAlgorithms 2.1Buildingaprefixcodegiventhedesiredlengths page 24 2.2BuildingaHuffmantree 27 2.3BuildingaCanonicalHuffmancoderepresentation 29 2.4ReadingasymbolwithaCanonicalHuffmancode 29 2.5Variousintegerencodings 34 3.1Readingandwritingonbitarrays 41 3.2Readingandwritingonfixed-lengthcellarrays 44 3.3Manipulatinginitializablearrays 52 3.4Readingfromadirectaccesscoderepresentation 55 3.5Creatingdirectaccesscodesfromanarray 56 3.6Findingoptimalpiecelengthsfordirectaccesscodes 58 3.7Intersectionofinvertedlists 60 4.1Encodinganddecodingbitblocksaspairs(c, o) 67 4.2Answering access oncompressedbitvectors 69 4.3Answering rank withsparsesampling 74 4.4Answering rank withdensesampling 75 4.5Answering rank oncompressedbitvectors 77 4.6Answering select withsparsesampling 80 4.7Buildingthe select structures 82 4.8Answering select and rank onverysparsebitvectors 85 4.9Buildingthestructuresforverysparsebitvectors 86 4.10Buildingaperfecthashfunction 94 5.1Answering π 1 withshortcuts 105 5.2Buildingtheshortcutstructure 107 5.3Answering π k withthecycledecomposition 108 5.4Answering π and π 1 oncompressiblepermutations 112 5.5Buildingthecompressedpermutationrepresentation,part1 113 5.6Buildingthecompressedpermutationrepresentation,part2 114 6.1Answeringquerieswiththepermutation-basedstructure 125 6.2Buildingthepermutation-basedrepresentationofasequence 126 xiii

xivlistofalgorithms 6.3Answering access and rank withwavelettrees 131 6.4Answering select withwavelettrees 134 6.5Buildingawavelettree 135 6.6Answering access and rank withwaveletmatrices 143 6.7Answering select withwaveletmatrices 144 6.8Buildingawaveletmatrix 145 6.9BuildingasuitableHuffmancodeforwaveletmatrices 149 6.10BuildingawaveletmatrixfromHuffmancodes 150 6.11Answeringquerieswithalphabetpartitioning 153 6.12Buildingthealphabetpartitioningrepresentation 155 6.13Answering π and π 1 usingsequences 156 6.14Invertedlistintersectionusingasequencerepresentation 158 6.15Non-positionalinvertedlistintersection 159 6.16Solvingrangequantilequeriesonwavelettrees 161 7.1ConvertingbetweenleafnumbersandpositionsofrmM-trees 171 7.2Buildingthe C tableforthermM-trees 174 7.3BuildingthermM-tree 175 7.4Scanningablockfor fwdsearch (i, d ) 177 7.5Computing fwdsearch (i, d ) 178 7.6Computing bwdsearch (i, d ) 181 7.7Scanningablockformin(i, j ) 182 7.8Computingtheminimumexcessin B[i, j ] 183 7.9Computing mincount (i, j ) 186 7.10Computing minselect (i, j, t ) 187 7.11Computing rank10 (i )on B 189 7.12Computing select10 ( j )on B 189 7.13Findingthesmallestsegmentofatypecontainingaposition 202 7.14Solving rmqA with2n parentheses 204 7.15BuildingthestructureforsuccinctRMQs 205 8.1ComputingtheordinaltreeoperationsusingLOUDS 216 8.2Computing lca (u,v )ontheLOUDSrepresentation 217 8.3BuildingtheLOUDSrepresentation 218 8.4ComputingthecardinaltreeoperationsusingLOUDS 220 8.5ComputingbasicbinarytreeoperationsusingLOUDS 221 8.6BuildingtheBPrepresentationofanordinaltree 223 8.7ComputingthesimpleBPoperationsonordinaltrees 225 8.8ComputingthecomplexBPoperationsonordinaltrees 227 8.9BuildingtheBPrepresentationofabinarytree 230 8.10ComputingbasicbinarytreeoperationsusingBP 231 8.11ComputingadvancedbinarytreeoperationsusingBP 234 8.12BuildingtheDFUDSrepresentation 235 8.13ComputingthesimpleDFUDSoperationsonordinaltrees 239 8.14ComputingthecomplexDFUDSoperationsonordinaltrees 240 8.15ComputingtheadditionalcardinaltreeoperationsonDFUDS 241 8.16ComputingthelabeledtreeoperationsonLOUDSorDFUDS 244 8.17Enumeratingthepathfrom u to v withLOUDS 247

listofalgorithmsxv 8.18Extractionandpatternsearchintries 255 8.19ExtractionofatextsubstringfromitsLZ78representation 262 8.20Reportingthelargestvaluesinarangeusingatreap 265 8.21Computing f k (i )withthecompactrepresentation 268 8.22Computing f k (i )withthecompactrepresentation 269 9.1Operationsongeneraldirectedgraphs 283 9.2Operationsongeneralundirectedgraphs 284 9.3Operationsonlabeleddirectedgraphs 289 9.4Label-specificoperationsondirectedgraphs 290 9.5Operation adj ona k 2 -tree 293 9.6Operations neigh and rneigh ona k 2 -tree 294 9.7Buildingthe k 2 -tree 297 9.8Operationsonone-pagegraphs 300 9.9Operations degree and neigh on k -pagegraphs 304 9.10Operation adj on k -pagegraphs 305 9.11Operationsonplanargraphs 312 9.12Findingwhichneighborof u is v onplanargraphs 313 9.13Additionaloperationsontheplanargraphrepresentation 314 9.14Operations neigh and degree ontriangulargraphs 317 9.15Operation adj ontriangulargraphs 318 9.16Object-objectjoinonRDFgraphsusing k 2 -trees 331 9.17Subject-objectjoinonRDFgraphsusing k 2 -trees 332 9.18Routingonaplanargraphthroughlocallymaximumbenefit 333 9.19Routingonaplanargraphthroughfacetraversals 334 9.20Two-visibilitydrawingofaplanargraph 337 10.1Answering count withawaveletmatrix 351 10.2Proceduresfor report onawaveletmatrix 354 10.3Findingtheleftmostpointinarangewithawaveletmatrix 356 10.4Findingthehighestpointsinarangewithawaveletmatrix 357 10.5Procedurefor report ona k 2 -tree 360 10.6Answering top withawaveletmatrix 363 10.7Prioritizedtraversalfor top ona k 2 -tree 368 10.8Recursivetraversalfor top ona k 2 -tree 370 10.9Procedurefor closest ona k 2 -tree 375 10.10Searchingfor P inagrammar-compressedtext T 387 11.1Comparing P with T [A[i], n]using 399 11.2Backwardsearchonacompressedsuffixarray 402 11.3Obtaining A[i]onacompressedsuffixarray 404 11.4Displaying T [ j, j + 1]onacompressedsuffixarray 405 11.5BackwardsearchonanFM-index 406 11.6Obtaining A[i]onanFM-index 408 11.7Displaying T [ j, j + 1]onanFM-index 408 11.8BuildingtheBWTofatext T incompactspace 417 11.9Generatingthepartitionof A forBWTconstruction 418 11.10Computingthesuffixtreeoperations 425 11.11Buildingthesuffixtreecomponents 429

xvilistofalgorithms 11.12Findingthemaximalintervalsof P thatoccuroftenin T 431 11.13Emulatingoperationsonvirtualsuffixtreenodes 433 11.14SubpathsearchonBWT-likeencodedlabeledtrees 435 11.15NavigationonBWT-likeencodedlabeledtrees 437 11.16Documentlisting 439 12.1Answering access and rank queriesonadynamicbitvector 453 12.2Answering select queriesonadynamicbitvector 454 12.3Processing insert onadynamicbitvector 456 12.4Processing delete onadynamicbitvector,part1 458 12.5Processing delete onadynamicbitvector,part2 459 12.6Processing bitset and bitclear onadynamicbitvector 460 12.7Answering access queriesonasparsedynamicbitvector 463 12.8Insertinganddeletingsymbolsonadynamicwavelettree 466 12.9Insertinganddeletingsymbolsonadynamicwaveletmatrix 468 12.10InsertinganddeletingleavesinaLOUDSrepresentation 470 12.11InsertinganddeletingleavesinaLOUDScardinaltree 471 12.12InsertinganddeletingnodesinaBPrepresentation 473 12.13InsertinganddeletingnodesinaDFUDSrepresentation 475 12.14InsertingparenthesesonadynamicrmM-tree 477 12.15Computing fwdsearch (i, d )onadynamicrmM-tree 478 12.16ComputingtheminimumexcessinadynamicrmM-tree 479 12.17Insertinganddeletinggridpointsusingawaveletmatrix 481 12.18Insertinganddeletinggridpointsusinga k 2 -tree 483 12.19InsertingadocumentonadynamicFM-index 488 12.20LocatinganddisplayingonadynamicFM-index 489 12.21DeletingadocumentonadynamicFM-index 491 13.1Reporting τ -majoritiesfromanencoding 508 13.2PerformingtheLZ76parsing 512 13.3ReportingoccurrencesontheLZ76-index 517 13.4Answering count withawaveletmatrixondisk 531 13.5BackwardsearchonareducedFM-index 538

Get Complete eBook Download Link below for instant download https://browsegrades.net/documents/2 86751/ebook-payment-link-for-instantdownload-after-payment

Foreword

Thisisadelightfulbookondatastructuresthatarebothtimeandspaceefficient.Space aswellastimeefficiencyiscrucialinmoderninformationsystems.Evenifwehave extraspacesomewhere,itisunlikelytobeclosetotheprocessors.Thespaceusedby mostsuchsystemsisoverwhelminglyforstructuralindexing,suchasB-trees,hash tables,andvariouscross-references,ratherthanfor“rawdata.”Indeeddata,suchas text,takefartoomuchspaceinrawformandmustbecompressed.Asystemthat keepsbothdataandindicesinacompactformhasamajoradvantage.

Hencethetitleofthebook.GonzaloNavarrousestheterm“compactdatastructures”todescribeanewlyemergingresearcharea.Ithasdevelopedfromtwodistinct butinterrelatedtopics.Theolderisthatoftextcompression,datingbacktothework ofShannon,Fano,andHuffman(amongothers)inthelate1940sandearly1950s (althoughtextcompressionassuchwasnottheirmainconcern).Throughthelasthalf ofthe20thcentury,asthesizeofthetexttobeprocessedincreasedandcomputing platformsbecamemorepowerful,algorithmicsandinformationtheorybecamemuch moresophisticated.Thegoalofdatacompression,atleastuntiltheyear2000orso, simplymeantcompressinginformationaswellaspossibleandthendecompressing eachtimeitwasneeded.Ahallmarkofcompactdatastructuresisworkingwithtextin compressedformsavingbothdecompressiontimeandspace.Thenewercontributing areaevolvedinthe1990saftertheworkofJacobsonandisgenerallyreferredtoas “succinctdatastructures.”Theideaistorepresentacombinatorialobject,suchasa graph,tree,orsparsebitvector,inanumberofbitsthatdiffersfromtheinformation theorylowerboundbyonlyalowerorderterm.So,forexample,abinarytreeon n nodestakesonly2n + o(n )bits.Thetrickistoperformthenecessaryoperations,e.g., findchild,parent,orsubtreesize,inconstanttime.

Compactdatastructurestakeintoaccountboth“data”and“structures”andarea littlemoretolerantof“besteffort”thanonemightbewithexactdetailsofinformation theoreticlowerbounds.Herethesubtitle,“APracticalApproach,”comesintoplay.The emphasisisonmethodsthatarereasonabletoimplementandappropriatefortoday’s (andtomorrow’s)datasizes,ratherthanontheasymptoticsthatoneseeswiththe“theoreticalapproach.” xvii

Readingthebook,Iwastakenwiththethoroughcoverageofthetopicandtheclarity ofpresentation.Finding,easily,specificresultswas,well,easy,assuitstheexperienced researcherinthefield.Ontheotherhand,thecarefulexpositionofkeyconcepts,with elucidatingexamples,makesitidealasagraduatetextorfortheresearcherfroma tangentiallyrelatedarea.Thebookcoversthehistoricalandmathematicalbackground alongwiththekeydevelopmentsofthe1990sandearlyyearsofthecurrentcentury, whichformitscore.Textindexinghasbeenamajordrivingforceforthearea,andtechniquesforitarenicelycovered.Thefinaltwochapterspointtolong-termchallenges andrecentadvances.Updatestocompactdatastructureshavebeenaproblemforas longasthetopichasbeenstudied.Thetreatmenthereisnotonlystateoftheartbut willundoubtedlybeamajorinfluenceonfurtherimprovementstodynamicstructures, akeyaspectofimprovingtheirapplicability.Thefinalchapterfocusesonencodings, workingwithrepetitivetext,andissuesofthememoryhierarchy.Thebookwillbea keyreferenceandguidinglightinthefieldforyearstocome.

J.IanMunro UniversityofWaterloo

xviiiforeword

Acknowledgments

IamindebtedtoJoshimarCórdovaandSimonGog,whotookthetimetoexhaustively readlargeportionsofthebook.Theymadeanumberofusefulcommentsandkilled manydangerousbugs.Severalotherstudentsandcolleaguesreadpartsofthebookand alsomadeusefulsuggestions:TravisGagie,PatricioHuepe,RobertoKonow,Susana Ladra,VeliMäkinen,MiguelÁngelMartínez-Prieto,IanMunro,andAlbertoOrdóñez. Others,likeYakovNekrich,RajeevRaman,andKunihikoSadakane,savedmehours ofsearchingbyprovidinginstantanswerstomyquestions.Lastbutnotleast,Renato CerrocarefullypolishedmyEnglishgrammar.Itismostlikelythatsomebugsremain, forwhichIamtheonlyonetoblame.

IanMunroenthusiasticallyagreedtowritetheForewordofthebook.Mythanks, again,toapioneerofthisbeautifularea.

Iwouldalsoliketothankmyfamilyforbearingwithmealongthistwo-year-long effort.Ithasbeenmuchmorefunformethanforthem.

Finally,IwishtothanktheDepartmentofComputerScienceattheUniversityof Chileforgivingmetheopportunityofalifededicatedtoacademiainafriendlyand supportiveenvironment.

xix

1.1WhyCompactDataStructures?

Google’sstatedmission,“toorganizetheworld’sinformationandmakeituniversally accessibleanduseful,”couldnotbettercapturetheimmenseambitionofmodernsocietyforgatheringallkindsofdataandputtingthemtousetoimproveourlives.Weare collectingnotonlyhugeamountsofdatafromthephysicalworld(astronomical,climatological,geographical,biological),butalsohuman-generateddata(voice,pictures, music,video,books,news,Webcontents,emails,blogs,tweets)andsociety-based behavioraldata(markets,shopping,traffic,clicks,Webnavigation,likes,friendship networks).

Ourhungerformoreandmoreinformationisfloodingourliveswithdata.Technologyisimprovingandourabilitytostoredataisgrowingfast,butthedataweare collectingalsogrowfast–inmanycasesfasterthanourstoragecapacities.Whileour abilitytostorethedatainsecondaryorperhapstertiarystoragedoesnotyetseemto becompromised,performingthedesiredprocessingofthesedatainthemainmemory ofcomputersisbecomingmoreandmoredifficult.Sinceaccessingadatuminmain memoryisabout105 timesfasterthanondisk,operatinginmainmemoryiscrucialfor carryingoutmanydata-processingapplications.

Inmanycases,theproblemisnotsomuchthesizeoftheactualdata,butthat ofthe datastructures thatmustbebuiltonthedatainordertoefficientlycarry outthedesiredprocessingorqueries.Insomecasesthedatastructuresareoneor twoordersofmagnitudelargerthanthedata!Forexample,theDNAofahuman genome,ofabout3.3billionbases,requiresslightlylessthan800megabytesifwe useonly2bitsperbase(A, C, G, T),whichfitsinthemainmemoryofanydesktopPC.However,thesuffixtree,apowerfuldatastructureusedtoefficientlyperform sequenceanalysisonthegenome,requiresatleast10bytesperbase,thatis,morethan 30gigabytes.

Themaintechniquestocopewiththegrowingsizeofdataoverrecentyearscanbe classifiedintothreefamilies:

CHAPTER1 Introduction

Efficientsecondary-memoryalgorithms. Whileaccessingarandomdatumfromdisk iscomparativelyveryslow,subsequentdataarereadmuchfaster,only100times slowerthanfrommainmemory.Therefore,algorithmsthatminimizetherandom accessestothedatacanperformreasonablywellondisk.Noteveryproblem, however,admitsagooddisk-basedsolution.

Streamingalgorithms. Inthesealgorithmsonegoestotheextremeofallowingonly oneorasmallnumberofsequentialpassesoverthedata,storingintermediate valuesonacomparativelysmallmainmemory.Whenonlyonepassoverthedata isallowed,thealgorithmcanhandlesituationsinwhichthedatacannotevenbe storedondisk,becausetheyeitheraretoolargeorflowtoofast.Inmanycases streamingalgorithmsaimatcomputingapproximateinformationfromthedata. Distributedalgorithms. Theseareparallelalgorithmsthatworkonanumberofcomputersconnectedthroughalocal-areanetwork.Networktransferspeedsarearound 10timesslowerthanthoseofdisks.However,somealgorithmsareamenableto parallelizationinawaythatthedatacanbepartitionedovertheprocessorsand littletransferofdataisneeded.

Eachoftheseapproachespaysapriceintermsofperformanceoraccuracy,and neitheroneisalwaysapplicable.Therearealsocaseswherememoryislimitedanda largesecondarymemoryisnotathand:routers,smartphones,smartwatches,sensors, andalargenumberoflow-endembeddeddevicesthataremoreandmorefrequently seeneverywhere(indeed,theyarethestarsofthepromisedInternetofThings).

Atopicthatisstronglyrelatedtotheproblemofmanaginglargevolumesofdata is compression,whichseeksawayofrepresentingdatausinglessspace.Compression buildsonInformationTheory,whichstudiestheminimumspacenecessarytorepresent thedata.

Mostcompressionalgorithmsrequiredecompressingallofthedatafromthebeginningbeforewecanaccessarandomdatum.Therefore,compressiongenerallyserves asaspace-saving archival method:thedatacanbe stored usinglessspacebutmustbe fullydecompressedbeforebeingusedagain.Compressionisnotusefulformanaging moredatainmainmemory,exceptifweneedonlytoprocessthedatasequentially.

Compactdatastructures aimpreciselyatthischallenge.Acompactdatastructure maintainsthedata,andthedesiredextradatastructuresoverit,inaformthatnotonly useslessspace,butisabletoaccessandquerythedata incompactform,thatis,without decompressingthem.Thus,acompactdatastructureallowsustofitandefficiently query,navigate,andmanipulatemuchlargerdatasetsinmainmemorythanwhatwould bepossibleifweusedthedatarepresentedinplainformandclassicaldatastructures ontop.

CompactdatastructureslieattheintersectionofDataStructuresandInformation Theory.Onelooksatdatarepresentationsthatnotonlyneedspaceclosetotheminimumpossible(asincompression)butalsorequirethatthoserepresentationsallow onetoefficientlycarryoutsomeoperationsonthedata.Intermsofinformation,data structuresarefully redundant:theycanbereconstructedfromthedataitself.However, theyarebuiltforefficiencyreasons:oncetheyarebuiltfromthedata,datastructures speedupoperationssignificantly.Whendesigningcompactdatastructures,onestruggleswiththistradeoff:supportingthedesiredoperationsasefficientlyaspossiblewhile

2introduction

increasingthespaceaslittleaspossible.Insomeluckycases,acompactdatastructurereachesalmosttheminimumpossiblespacetorepresentthedataandprovidesa richfunctionalitythatencompasseswhatisprovidedbyanumberofindependentdata structures.Generaltreesandtextcollectionsareprobablythetwomoststrikingsuccess storiesofcompactdatastructures(andtheyhavebeencombinedtostorethehuman genome and itssuffixtreeinlessthan4gigabytes!).

Compactdatastructuresusuallyrequiremorestepsthanclassicaldatastructuresto completethesameoperations.However,iftheseoperationsarecarriedoutonafaster memory,thenetresultisafaster(andsmaller)representation.Thiscanoccuratany levelofthememoryhierarchy;forexample,acompactdatastructuremaybefaster becauseitfitsincachewhentheclassicalonedoesnot.Themostdramaticimprovement,however,isseenwhenthecompactdatastructurefitsinmainmemorywhile theclassicaloneneedstobehandledondisk(evenifitisasolid-statedevice).In somecases,suchaslimited-memorydevices,compactdatastructuresmaybetheonly approachtooperateonlargerdatasets.

Theothertechniqueswehavedescribedcanalsobenefitfromtheuseofcompact datastructures.Forexample,distributedalgorithmsmayusefewercomputerstocarry outthesametask,astheiraggregatedmemoryisvirtuallyenlarged.Thisreduceshardware,communication,andenergycosts.Secondary-memoryalgorithmsmayalsobenefitfromavirtuallylargermainmemorybyreducingtheamountofdisktransfers. Streamingalgorithmsmaystoremoreaccurateestimationswithinthesamemainmemorybudget.

1.2WhyThisBook?

Thestartingpointoftheformalstudyofcompactdatastructurescanbetracedback tothe1988Ph.D.thesisofJacobson,althoughearlierworks,inretrospect,canalso besaidtobelongtothisarea.Sincethen,thestudyofthesestructureshasfluorished, andresearcharticlesappearroutinelyinmostconferencesandjournalsonalgorithms, compression,anddatabases.Varioussoftwarerepositoriesoffermaturelibrariesimplementinggenericorproblem-specificcompactdatastructures.Therearealsoindications oftheincreasinguseofcompactdatastructuresinsidetheproductsofGoogle,Facebook,andothers.

Webelievethatcompactdatastructureshavereachedalevelofmaturitythat deservesabooktointroducethem.Therearealreadyestablishedcompactdatastructurestorepresentbitvectors,sequences,permutations,trees,grids,binaryrelations, graphs,tries,textcollections,andothers.Surprisingly,therearenootherbooksonthis topicasfarasweknow,andformanyrelevantstructurestherearenosurveyarticles.

Thisbookaimstointroducethereadertothefascinatingalgorithmicworldofthe compactdatastructures,withastrongemphasisonpracticality.Mostofthestructureswepresenthavebeenimplementedandfoundtobereasonablyeasytocode andefficientinspaceandtime.Afewofthestructureswepresenthavenotyetbeen implemented,butbasedonourexperiencewebelievetheywillbepracticalaswell. Wehaveobtainedthematerialfromthelargeuniverseofpublishedresultsandfrom ourownexperience,carefullychoosingtheresultsthatshouldbemostrelevanttoa

whythisbook?3

practitioner.Eachchapterfinisheswithalistofselectedreferencestoguidethereader whowantstogofurther.

Ontheotherhand,wedonotleaveasidethetheory,whichisessentialforasolid understandingofwhyandhowthedatastructureswork,andthusforapplyingand extendingthemtofacenewchallenges.Wegentlyintroducethereadertothebeauty ofthealgorithmicsandthemathematicsthatarebehindthestudyofcompactdata structures.Onlyabasicbackgroundisexpectedfromthereader.Fromalgorithmics, knowledgeofsorting,binarysearch,dynamicprogramming,graphtraversals,hashing, lists,stacks,queues,priorityqueues,trees,and O -notationsuffices(wewillbriefly reviewthisnotationlaterinthischapter).Thismaterialcorrespondstoafirstcourseon algorithmsanddatastructures.Frommathematics,understandingofinduction,basic combinatorics,probability,summations,andlimits,thatis,afirst-yearuniversitycourse onalgebraordiscretemathematics,issufficient.

Weexpectthisbooktobeusefulforadvancedundergraduatestudents,graduate students,researchers,andprofessionalsinterestedinalgorithmictopics.Hopefullyyou willenjoythereadingasmuchasIhaveenjoyedwritingit.

1.3Organization

Thebookisdividedinto 13 chapters.Eachchapterbuildsonpreviousonestointroduce anewconceptandincludesasectiononapplicationsandabibliographicdiscussionat theend.Applicationsaresmallerormorespecificproblemswherethedescribeddata structuresprovideusefulsolutions.Mostcanbesafelyskippedifthereaderhasno time,butweexpectthemtobeinspiring.Thebibliographycontainsannotatedreferencespointingtothebestsourcesofthematerialdescribedinthechapter(whichnot alwaysarethefirstpublications),themostrelevanthistoriclandmarksinthedevelopmentoftheresults,andopenproblems.Thissectionisgenerallydenserandcanbe safelyskippedbyreadersnotinterestedingoingdeeper,especiallyintothetheoretical aspects.

Pseudocodeisincludedformostoftheprocedureswedescribe.Thepseudocode ispresentedinanalgorithmiclanguage,notinanyspecificprogramminglanguage. Forexample,well-knownvariablesaretakenasglobalwithoutnotice,widelyknown proceduressuchasabinarysearcharenotdetailed,andtediousbutobviousdetails areomitted(withnotice).Thisletsusfocusontheimportantaspectsthatwewant thepseudocodetoclearup;ourintentionisnotthatthepseudocodeisacut-and-paste texttogetthestructuresrunningwithoutunderstandingthem.Werefrainfrommaking variousprogramming-leveloptimizationstothepseudocodetofavorclarity;anygood programmershouldbeabletoconsiderablyspeedupaverbatimimplementationofthe pseudocodeswithoutalteringtheirlogic.

Afterthisintroductorychapter,Chapter 2 introducestheconceptsofInformation Theoryandcompressionneededtofollowthebook.Inparticular,weintroducethe conceptsofworst-case,Shannon,andempiricalentropyandtheirrelations.Thisis themostmathematicalpartofthebook.WealsointroduceHuffmancodesandcodes suitableforsmallintegers.

4introduction

Permutations Arrays

Thesubsequentchaptersdescribecompactdatastructuresfordifferentproblems. Eachcompactdatastructurestoressomekindofdataandsupportsawell-defined setofoperations.Chapter 3 considersarrays,whichsupporttheoperationsofreadingandwritingvaluesatarbitrarypositions.Chapter 4 describesbitvectors,arrays ofbitsthatinadditionsupportacoupleofbit-countingoperations.Chapter 5 covers representationsofpermutationsthatsupportboththeapplicationofthepermutation anditsinverseaswellaspowersofthepermutation.Chapter 6 considerssequences ofsymbols,which,apartfromaccessingthesequence,supportacoupleofsymbolcountingoperations.Chapter 7 addresseshierarchicalstructuresdescribedwithbalancedsequencesofparenthesesandoperationstonavigatethem.Chapter 8 dealswith therepresentationofgeneraltrees,whichsupportalargenumberofqueryandnavigationoperations.Chapter 9 considersgraphrepresentations,bothgeneralonesand forsomespecificfamiliessuchasplanargraphs,allowingnavigationtowardneighbors.Chapter 10 considersdiscretetwo-dimensionalgridsofpoints,withoperations forcountingandreportingpointsinaqueryrectangle.Chapter 11 showshowtextcollectionscanberepresentedsothatpatternsearchqueriesaresupported.

Assaid,eachchapterbuildsuponthestructuresdescribedpreviously,althoughmost ofthemcanbereadindependentlywithonlyaconceptualunderstandingofwhatthe operationsonpreviousstructuresmean.Figure 1.1 showsthemostimportantdependenciesforunderstandingwhypreviousstructuresreachtheclaimedspaceandtime performance.

Thesechaptersarededicatedto static datastructures,thatis,thosethatarebuilt onceandthenservemanyqueries.Thesearethemostdevelopedandgenerallythe mostefficientones.Wepayattentiontoconstructiontimeand,especially,construction space,ensuringthatstructuresthattakelittlespacecanalsobebuiltwithinlittleextra memory,orthattheconstructionisdisk-friendly.Structuresthatsupportupdatesare called dynamic andareconsideredinChapter 12

ThebookconcludesinChapter 13,whichsurveyssomecurrentresearchtopicson compactdatastructures:encodingdatastructures,indexesforrepetitivetextcollections,anddatastructuresforsecondarystorage.Thoseareasarenotgeneralormature enoughtobeincludedinpreviouschapters,yettheyareverypromisingandwillprobablybethefocusofmuchresearchintheupcomingyears.Thechapterthenalsoserves asaguidetocurrentresearchtopicsinthisarea.

organization5

Graphs Grids Texts Trees Bitvectors Sequences Parentheses Compression

Figure1.1. ThemostimportantdependenciesamongChapters 2–11

Althoughwehavedoneourbesttomakethebookerror-free,andhavemanually verifiedthealgorithmsseveraltimes,itislikelythatsomeerrorsremain.AWebpage withcomments,updates,andcorrectionsonthebookwillbemaintainedat http://www .dcc.uchile.cl/gnavarro/CDSbook

1.4SoftwareResources

Althoughthisbookfocusesonunderstandingthecompactdatastructuressothatthe readerscanimplementthembythemselves,itisworthnotingthatthereareseveral open-sourcesoftwarerepositorieswithmatureimplementations,bothforgeneraland forproblem-specificcompactdatastructures.Thesearevaluablebothforpractitioners thatneedastructureimplementedefficiently,welltested,andreadytobeused,and forstudentsandresearchersthatwishtobuildfurtherstructuresontopofthem.In bothcases,understandingwhyandhoweachstructureworksisessentialtomakingthe rightdecisionsonwhichstructuretouseforwhichproblem,howtoparameterizeit, andwhatcanbeexpectedfromit.

Probablythemostgeneral,professional,exhaustive,andwelltestedofallthese librariesisSimonGog’s SuccinctDataStructureLibrary(SDSL),availableat https:// github.com/simongog/sdsl-lite.Itcontains C++ implementationsofcompactdatastructuresforbitvectors,arrays,sequences,textindexes,trees,rangeminimumqueries,and suffixtrees,amongothers.Thelibraryincludestoolstoverifycorrectnessandmeasure efficiencyalongwithtutorialsandexamples.

AnothergenericlibraryisFranciscoClaude’s LibraryofCompactDataStructures (LIBCDS),availableat https://github.com/fclaude/libcds.Itcontainsoptimizedand well-tested C++ implementationsofbitvectors,sequences,permutations,andothers. Atutorialonhowtousethelibraryandhowitworksisincluded.

SebastianoVigna’s Sux library,availableat http://sux.di.unimi.it,containshighquality C++ and/or Java implementationsofvariouscompactdatastructures,includingbitvectors,arrayswithcellsofvaryinglengths,and(generalandmonotone)minimalperfecthashing.Otherprojectsaccessiblefromthereincludesophisticatedtools tomanageinvertedindexesandWebgraphsincompressedform.

GiuseppeOttaviano’s Succinct libraryprovidesefficient C++ implementationsof bitvectors,arraysoffixedandvariable-lengthcells,rangeminimumqueries,andothers. Itisavailableat https://github.com/ot/succinct.

Finally,NicolaPrezza’s Dynamic libraryprovidesC++implementationsofvarious datastructuressupportinginsertionsofnewelements:partialsums,bitvectors,sparse arrays,strings,andtextindexes.Itisavailableat https://github.com/nicolaprezza/ DYNAMIC.

Theauthorsofmanyoftheselibrarieshaveexploredmuchdeeperpracticalaspects oftheimplementation,includingcacheefficiency,addresstranslation,wordalignments,machineinstructionsforlongcomputerwords,instructionpipelining,andother issuesbeyondthescopeofthisbook.

Manyotherauthorsofarticlesonpracticalcompactdatastructuresforspecific problemshavelefttheirimplementationspubliclyavailableorarewillingtoshare themuponrequest.Therearetoomanytolisthere,butbrowsingthepersonalpages

6introduction

oftheauthors,orrequestingthecode,isprobablyafastwaytoobtainagood implementation.

1.5MathematicsandNotation

Thisfinaltechnicalsectionisareminderofthemathematicsbehindthe O -notation, whichweusetodescribethetimeperformanceofalgorithmsandthespaceusageof datastructures.Wealsointroduceothernotationusedthroughoutthebook.

O -notation. Thisnotationisusedtodescribetheasymptoticgrowthoffunctions(for example,thecostofanalgorithmasafunctionofthesizeoftheinput)inawaythat considersonlysufficientlylargevaluesoftheargument(hencethename“asymptotic”) andignoresconstantfactors.

Formally, O ( f (n ) ) isthesetofallfunctions g(n )forwhichthereexistconstants c > 0and n0 > 0suchthat,forall n > n0 ,itholds |g(n )|≤ c ·| f (n )|.Wesaythat g(n )is

O ( f (n ) ),meaningthat g(n ) ∈ O ( f (n ) ).Thus,forexample,3n2 + 6n 3is O n2 and also O n3 ,butitisnot O (n log n ).Inparticular, O (1 ) isusedtodenoteafunctionthatis alwaysbelowsomeconstant.Forexample,thecostofanalgorithmthat,independently oftheinputsize,performs3accessestotablesandterminatesis O (1 ).Analgorithm taking O (1 ) timeissaidtobeconstant-time.

Itisalsocommontoabusethenotationandwrite g(n ) = O ( f (n ) ) tomean g(n ) ∈

O ( f (n ) ),andeventowrite,say, g(n ) < 2n + O (log n ),meaningthat g(n )issmaller than2n plusafunctionthatis O (log n ).Sometimeswewillwrite,forexample, g(n ) = 2n O (log n ),tostressthat g(n ) ≤ 2n andthefunctionthatseparates g(n )from2n is O (log n ).

Severalothernotationsarerelatedto O .Mostlyforlowerbounds,wewrite g(n ) ∈ ( f (n )),meaningthatthereexistconstants c > 0and n0 > 0suchthat,forall n > n0 ,itholds |g(n )|≥ c ·| f (n )|.Alternatively,wecandefine g(n ) ∈ ( f (n ))iff f (n ) ∈ O (g(n ) ).Wesaythat g(n )is ( f (n ))tomeanthat g(n )is O ( f (n ) ) andalso ( f (n )). Thismeansthatbothfunctionsgrow,asymptotically,atthesamespeed,exceptfora constantfactor.

Todenotefunctionsthatareasymptoticallynegligiblecomparedto f (n ),weuse g(n ) = o( f (n )),whichmeansthatlimn→∞ g(n ) f (n ) = 0.Forexample,sayingthatadata structureuses2n + o(n )bitsmeansthatituses2n plusanumberofbitsthatgrows sublinearlywith n,suchas2n + O (n/ log n ).Thenotation o(1)denotesafunctionthat tendstozeroas n tendstoinfinity,forexample,loglog n/ log n = o(1).Finally,the oppositeofthe o( )notationis ω ( ),where g(n ) = ω ( f (n ))iff f (n ) = o(g(n )).Inparticular, ω (1)denotesafunctionthattendstoinfinity(nomatterhowslowly)when n tendstoinfinity.Forexample,loglog n = ω (1).

Whenseveralvariablesareused,asin o(n log σ ),itmustbecleartowhichthe o( ) notationrefers.Forexample, n loglog σ is o(n log σ )ifthevariableis σ ,orifthevariableis n but σ growswith n (i.e., σ = ω (1)asafunctionof n).Otherwise,ifwerefer to n but σ isaconstant,then n loglog σ isnot o(n log σ ).

Thesenotationsarealsousedondecreasingfunctionsof n,todescribeerrormargins.Forexample,wemayapproximatetheharmonicnumber Hn = n k =1 1 k = ln

mathematicsandnotation7

Logarithm. ThisisaveryimportantfunctioninInformationTheory,asitisthekeyto describingtheentropy,oramountofinformation,inanobject.Whentheentropy(or information)isdescribedinbits,thelogarithmmustbetothebase2.Weuselogto denotethelogarithmtothebase2.Whenweusealogarithmtosomeotherbase b,we writelogb .Asshown,thenaturallogarithmiswrittenasln.Ofcourse,thebaseofthe logarithmmakesnodifferenceinside O -formulas(unlessitisintheexponent!).

Theinequality x 1+x ≤ ln(1 + x ) ≤ x isusefulinmanycases,inparticularincombinationwiththe O -notation.Forexample,

8introduction n + γ + 1 2n 1 12n2 + 1 120n4 ,where γ ≈ 0 577isaconstant,withanyofthefollowingformulas,havingadecreasinglevelofdetail:1 Hn = ln n + γ + 1 2n + O 1 n2 = ln n + γ + O 1 n = ln n + O (1 ) = O (log n

Hn = ln n + γ + 1 2n + o 1 n = ln n + γ + o(1) = ln n + o(log n ). Wecanalsowritetheerrorinrelativeform,forexample, Hn = ln n + γ + 1 2n · 1 + O 1 n = ln n 1 + O 1 log n = ln n (1 + o(1))

1 1+x = 1 x + x2 ... = 1 O (x ),forany0 < x < 1,allowsustowrite 1 1+o(1) = 1 + o(1),whichisusefulfor movingerrortermsfromthedenominatortothenumerator.

), dependingonthedegreeofaccuracywewant.Wecanalsouse

( )togivelessdetails abouttheerrorlevel,forexample,

Whenusingthenotationtodenoteerrors,theinequality

ln(

≤ ln n

ln(n (1 +

≥ ln n + o(1) 1 + o(1) = ln n + o(1).

n (1 + o(1))) = ln n + ln(1 +

(1))

(1). Italsoholds

o(1)))

1 Inthefirstline,weusethefactthatthetailoftheseriesconvergesto c n2 ,forsomeconstant c

Therefore,ln(n (1 + o(1))) = ln n + o(1).Moregenerally,if f (n ) = o(1),and b isany constant,wecanwritelogb (n (1 + f (n ))) = logb n + O ( f (n ) ).Forexample,log(n + log n ) = log n + O (log n/n )

Modelofcomputation. Weconsiderrealisticcomputers,withacomputerwordof w bits,wherewecancarryoutinconstanttimeallthebasicarithmetic(+, , · , /,mod, ceilingsandfloors,etc.)andlogicoperations(bitwise and , or , not , xor ,bitshifts,etc.).

Inmoderncomputers w isalmostalways32or64,butseveralarchitecturesallowfor largerwordstobehandlednatively,reaching,forexample,128,256,or512bits.

Whenconnectingwiththeory,thisessentiallycorrespondstotheRAMmodelof computation,wherewedonotpayattentiontorestrictionsinsomebranchesofthe RAMmodelthatareunrealisticonmoderncomputers(forexample,somevariants disallowmultiplicationanddivision).IntheRAMmodel,itisusuallyassumedthatthe computerwordhas w = (log n )bits,where n isthesizeofthedatainmemory.This logarithmicmodelofgrowthofthecomputerwordisappropriateinpractice,as w has beengrowingapproximatelyasthelogarithmofthesizeofmainmemories.Itisalso reasonabletoexpectthatwecanstoreanymemoryaddressinaconstantnumberof words(andinconstanttime).

Forsimplicityandpracticality,wewillusetheassumption w ≥ log n,whichmeans thatwithonecomputerwordwecanaddressanydataelement.Whiletheassumption w = O (log n ) mayalsobejustified(wemayarguethatthedatashouldbelargeenough forthecompactstorageproblemtobeofinterest),thisisnotalwaysthecase.Forexample,thedynamicstructures(Chapter 12)maygrowandshrinkovertime.Therefore,we willnotrelyonthisassumption.Thus,forexample,wewillsaythatthecostofanalgorithmthatinspects n bitsbychunksof w bits,processingeachchunkinconstanttime, is O (n/w ) = O (n/ log n ) = o(n ).Instead,wewillnottakean O ( w )-timealgorithmto be O (log n ).

Strings,sequences,andintervals. Inmostcases,ourarraysstartatposition1.With [a, b]wedenotetheset {a, a + 1, a + 2,..., b},unlessweexplicitlyimplyitisareal interval.Forexample, A[1, n]denotesanarrayof n elements A[1], A[2],..., A[n].

A string isanarrayofelementsdrawnfromafiniteuniverse,calledthe alphabet. Alphabetsareusuallydenoted = [1,σ ],where σ issomeinteger,meaningthat

={1, 2,...,σ }.Thealphabetelementsarecalled symbols, characters,or letters. The length ofthestring S[1, n]is |S|= n.Thesetofallthestringsoflength n over alphabet isdenoted n ,andthesetofallthestringsofanylengthover isdenoted

∗ =∪n≥0 n .Stringsandsequencesarebasicallysynonymsinthisbook;however, substringandsubsequencearedifferentconcepts.Givenastring S[1, n],a substring S[i, j ]is,precisely,thearray S[i], S[i + 1],..., S[ j ].Particularcasesofsubstringsare prefixes,oftheform S[1, j ],and suffixes,oftheform S[i, n].When i > j , S[i, j ]denotes theemptystring ε ,thatis,theonlystringoflengthzero.A subsequence ismoregeneral thanasubstring:itcanbeany S[i1 ] . S[i2 ] ... S[ir ]for i1 < i2 <...< ir ,whereweuse thedottodenoteconcatenationofsymbols(wemightalsosimplywriteonesymbol aftertheother,ormixstringsandsymbolsinaconcatenation).Sometimeswewillalso use a, b todenotethesameas[a, b]orwritesequencesas a1 , a2 ,..., an .Finally, givenastring S[1, n], Srev denotesthereversedstring, S[n] . S[n 1] ... S[2] . S[1].

mathematicsandnotation9

1.6BibliographicNotes

Growthofinformationandcomputingpower. Google’smissionisstatedin http:// www.google.com/about/company.

Therearemanysourcesthatdescribetheamountofinformationtheworldisgathering.Forexample,a2011studyfrom InternationalDataCorporation(IDC) foundthat wearegeneratingafewzettabytesperyear(azettabyteis270 ,orroughly1021 ,bytes), andthatdataaremorethandoublingperyear,outperformingMoore’slaw(whichgovernsthegrowthofhardwarecapacities).2 Arelateddiscussionfrom2013,arguingthat wearemuchbetteratstoringthanatusingallthesedata,canbereadin Datamation. 3 Forashockingandgraphicalmessage,the2012posterof Domo isalsotelling.4

TherearealsomanysourcesaboutthedifferencesinperformancebetweenCPU, caches,mainmemory,andsecondarystorage,aswellashowthesehaveevolvedover theyears.Inparticular,weusedthebookofHennessyandPatterson(2012,Chap.1) fortheroughnumbersshownhere.

Examplesofbooksaboutthementionedalgorithmicapproachestosolvetheproblemofdatagrowthare,amongmanyothers,Vitter(2008)forsecondary-memoryalgorithms,Muthukrishnan(2005)forstreamingalgorithms,andRoosta(1999)fordistributedalgorithms.

Suffixtrees. ThebookbyGusfield(1997)providesagoodintroductiontosuffixtrees inthecontextofbioinformatics.Modernbookspaymoreattentiontospaceissuesand makeuseofsomeofthecompactdatastructureswedescribehere(Ohlebusch, 2013; Mäkinen etal., 2015).Oursizeestimatesforcompressedsuffixtreesaretakenfrom thePh.D.thesisofGog(2011).

Compactdatastructures. Despitesomepreviousisolatedresults,thePh.D.thesis ofJacobson(1988)isgenerallytakenasthestartingpointofthesystematicstudyof compactdatastructures. Jacobson coinedtheterm succinctdatastructure todenotea datastructurethatuseslog N + o(log N )bits,where N isthetotalnumberofdifferent objectsthatcanbeencoded.Forexample,succinctdatastructuresforarraysof n bits mustuse n + o(n )bits,since N = 2n .Toexcludemeredatacompressors,succinctdata structuresaresometimesrequiredtosupportqueriesinconstanttime(Munro, 1996).

Inthisbookweusetheterm compactdatastructure,whichreferstothebroaderclass ofdatastructuresthataimatusinglittlespaceandquerytime.Otherrelatedtermsare usedintheliterature(notalwaysconsistently)torefertoparticularsubclassesofdata structures(FerraginaandManzini, 2005;GálandMiltersen, 2007;FischerandHeun, 2011;Raman, 2015): compressed or opportunistic datastructuresarethoseusing H + o(log N )bits,where H istheentropyofthedataundersomecompressionmodel(such asthebitarrayrepresentationswedescribeinSection 4.1.1);datastructuresusing H + o(H )bitsaresometimescalled fullycompressed (forexample,theHuffman-shaped wavelettreesofSection 6.2.4 arealmostfullycompressed).Adatastructurethatadds

2 http://www.emc.com/about/news/press/2011/20110628-01.htm

3 http://www.datamation.com/applications/big-data-analytics-overview.html

4 http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute.

10introduction

o(log N )bitsandoperatesonanyrawdatarepresentationthatoffersbasicaccess(such asthe rank and select structuresforbitvectorsinSections 4.2 and 4.3)issometimes calleda succinctindex ora systematic datastructure.Manyindexessupportingdifferent functionalitiesmaycoexistoverthesamerawdata,addinguptolog N + o(log N )bits. A non-systematic or encoding datastructure,instead,needstoencodethedataina particularformat,whichmaybeunsuitableforanothernon-systematicdatastructure (likethewavelettreesofSection 6.2);inexchangeitmayuselessspacethanthebest systematicdatastructure(seeSection 4.7 forthecaseofbitvectors).Adatastructure thatuses o(log N )bitsanddoesnotneedtoaccesstothedataatallisalsocalled nonsystematic oran encoding,inthesensethatitdoesnotaccesstherawdata.Suchsmall encodingsarespecial,however,becausetheycannotpossiblyreproducetheoriginal data;theyansweronlysometypesofqueriesonit(anexampleisgiveninSection 7.4.1; thenwestudyencodingswithmoredetailinSection 13.1).

Thesecondeditionofthe EncyclopediaofAlgorithms (Kao, 2016)containsgood shortsurveysonmanyofthestructureswediscussinthebook.

Requiredknowledge. Goodbooksonalgorithms,whichserveasacomplementto followthisbook,arebyCormen etal. (2009),SedgewickandWayne(2011),andAho etal. (1974),amongtoomanyotherstocitehere.Thelastone(Aho etal., 1974)is alsoagoodreferencefortheRAMmodelofcomputation.Authoritativesourceson algorithmics(yetpossiblyhardertoreadforthenovice)arethemonumentalworksof Knuth(1998)andMehlhorn(1984).Booksonalgorithmsgenerallycoveranalysisand O -notationaswell.Rawlins(1992)hasanicebookthatismorefocusedonanalysis. ThebooksbyGraham etal. (1994)andbySedgewickandFlajolet(2013)giveadeeper treatment,andthehandbookbyAbramowitzandStegun(1964)isanoutstandingreference.CoverandThomas(2006)offeranexcellentbookonInformationTheoryand compressionfundamentals;wewillcovertherequiredconceptsinChapter 2.

Implementations. Someofthecompactdatastructurelibrarieswehavedescribed haveassociatedpublications,forexample,Gog’s(GogandPetri, 2014)andOttaviano’s(GrossiandOttaviano, 2013).Anotherrecentpublication(Agarwal etal., 2015) reportson Succinct,adistributedstringstoreforcolumn-orienteddatabasesthatsupportsupdatesandsophisticatedstringsearches,achievinghighperformancethrough theuseofcompactdatastructures.Nopubliccodeisreportedforthelatter,however.

AcoupleofrecentarticleshintattheinterestinsideGoogleforthedevelopmentof compacttries(Chapter 8)forspeechrecognitioninAndroiddevices(Lei etal., 2013) andformachinetranslation(SorensenandAllauzen, 2011).Arelatedimplementation, calledMARISAtries,isavailableat https://code.google.com/p/marisa-trie.

Facebook’s Folly library(https://github.com/facebook/folly)nowcontainsanimplementationofElias-Fanocodes(Chapter 3).5

Anexampleoftheuseofcompressedtextindexes(Chapter 11)inbioinformatic applicationsistheBurrows-WheelerAligner(BWA)software(LiandDurbin, 2010), availablefrom http://bio-bwa.sourceforge.net.

bibliographicnotes11

5 https://github.com/facebook/folly/blob/master/folly/experimental/EliasFanoCoding.h

Bibliography

Abramowitz,M.andStegun,I.A.(1964). HandbookofMathematicalFunctionswithFormulas, Graphs,andMathematicalTables.Dover,9thedition.

Agarwal,R.,Khandelwal,A.,andStoica,I.(2015).Succinct:Enablingqueriesoncompresseddata.In Proc.12thUSENIXSymposiumonNetworkedSystemsDesignandImplementation(NSDI),pages 337–350.

Aho,A.V.,Hopcroft,J.E.,andUllman,J.D.(1974). TheDesignandAnalysisofComputerAlgorithms.Addison-Wesley.

Cormen,T.H.,Leiserson,C.E.,Rivest,R.L.,andStein,C.(2009). IntroductiontoAlgorithms.MIT Press,3rdedition.

Cover,T.andThomas,J.(2006). ElementsofInformationTheory.Wiley,2ndedition.

Ferragina,P.andManzini,G.(2005).Indexingcompressedtexts. JournaloftheACM , 52(4),552–581.

Fischer,J.andHeun,V.(2011).Space-efficientpreprocessingschemesforrangeminimumqueries onstaticarrays. SIAMJournalonComputing, 40(2),465–492.

Gál,A.andMiltersen,P.B.(2007).Thecellprobecomplexityofsuccinctdatastructures. Theoretical ComputerScience, 379(3),405–417.

Gog,S.(2011). CompressedSuffixTrees:Design,Construction,andApplications.Ph.D.thesis,Ulm University,Germany.

Gog,S.andPetri,M.(2014).Optimizedsuccinctdatastructuresformassivedata. SoftwarePractice andExperience, 44(11),1287–1314.

Graham,R.L.,Knuth,D.E.,andPatashnik,O.(1994). ConcreteMathematics–AFoundationfor ComputerScience.Addison-Wesley,2ndedition.

Grossi,R.andOttaviano,G.(2013).Designofpracticalsuccinctdatastructuresforlargedatacollections.In Proc.12thInternationalSymposiumonExperimentalAlgorithms(SEA),LNCS7933, pages5–17.

Gusfield,D.(1997). AlgorithmsonStrings,TreesandSequences:ComputerScienceandComputationalBiology.CambridgeUniversityPress.

Hennessy,J.L.andPatterson,D.A.(2012). ComputerArchitecture:AQuantitativeApproach MorganKauffman,5thedition.

Jacobson,G.(1988). SuccinctDataStructures.Ph.D.thesis,CarnegieMellonUniversity.

Kao,M.-Y.,editor(2016). EncyclopediaofAlgorithms.Springer,2ndedition.

Knuth,D.E.(1998). TheArtofComputerProgramming,volume3:SortingandSearching.AddisonWesley,2ndedition.

Lei,X.,Senior,A.,Gruenstein,A.,andSorensen,J.(2013).Accurateandcompactlargevocabulary speechrecognitiononmobiledevices.In Proc.14thAnnualConferenceoftheInternationalSpeech CommunicationAssociation(INTERSPEECH),pages662–665.

Li,H.andDurbin,R.(2010).Fastandaccuratelong-readalignmentwithBurrows-Wheelertransform. Bioinformatics, 26(5),589–595.

Mäkinen,V.,Belazzougui,D.,Cunial,F.,andTomescu,A.I.(2015). Genome-ScaleAlgorithm Design.CambridgeUniversityPress.

Mehlhorn,K.(1984). DataStructuresandAlgorithms1:SortingandSearching.EATCSMonographs onTheoreticalComputerScience.Springer-Verlag.

Munro,J.I.(1996).Tables.In Proc.16thConferenceonFoundationsofSoftwareTechnologyand TheoreticalComputerScience(FSTTCS),LNCS1180,pages37–42.

Muthukrishnan,S.(2005). DataStreams:AlgorithmsandApplications.NowPublishers.

Ohlebusch,E.(2013). BioinformaticsAlgorithms:SequenceAnalysis,GenomeRearrangements,and PhylogeneticReconstruction.OldenbuschVerlag.

12introduction

Raman,R.(2015).Encodingdatastructures.In Proc.9thInternationalWorkshoponAlgorithmsand Computation(WALCOM),LNCS8973,pages1–7.

Rawlins,G.J.E.(1992). ComparedtoWhat?AnIntroductiontotheAnalysisofAlgorithms.Computer SciencePress.

Roosta,S.H.(1999). ParallelProcessingandParallelAlgorithms:TheoryandComputation. Springer.

Sedgewick,R.andFlajolet,P.(2013). AnIntroductiontotheAnalysisofAlgorithms.Addison-WesleyLongman,2ndedition.

Sedgewick,R.andWayne,K.(2011). Algorithms.Addison-Wesley,4thedition.

Sorensen,J.andAllauzen,C.(2011).Unarydatastructuresforlanguagemodels.In Proc.12th AnnualConferenceoftheInternationalSpeechCommunicationAssociation(INTERSPEECH), pages1425–1428.

Vitter,J.S.(2008). AlgorithmsandDataStructuresforExternalMemory.NowPublishers.

bibliography13

Get Complete eBook Download Link below for instant download https://browsegrades.net/documents/2 86751/ebook-payment-link-for-instantdownload-after-payment