
An Introduction to String Algorithms
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
An essential introduction to the building blocks of modern text processing
String algorithms make it possible to process, store, and manipulate text with computational efficiency, with applications ranging from search engines and social networks that regularly process terabytes of information to areas like genomics, where the genome of an organism can be encoded as a long string of letters. This book provides an incisive introduction to the concepts and applications that every practitioner in the field needs to know. Ideal for the classroom and self-study, it guides readers from the fundamentals of string processing to advanced computational methods, presenting useful data structures and proof techniques for strings and other data and serving as an on-ramp to doing cutting-edge research in string algorithms.
- Discusses topics ranging from exact string matching and efficient edit distance computation to modern string data structures, sketching methods, and generative models of strings
- Covers data structures such as suffix trees, suffix arrays, wavelet trees, the Burrows-Wheeler transform, the FM index, and compressed bit vectors
- Presents an array of algorithms along with their proofs of correctness and running time
- Develops the skills needed to design and implement new string algorithms as well as various algorithmic techniques that are applicable beyond string algorithms
- Invaluable for anyone interested in processing large collections of string data, including genomic sequences and text for training large language models
- Includes hundreds of exercises and explanatory figures
- An indispensable resource for graduate students, advanced undergraduates, researchers, and practitioners
More details
Other editions
Additional editions

Person
Content
- Cover
- Contents
- Preface
- 1. Introduction
- 1.1 What are strings?
- 1.2 Why study string algorithms?
- 1.3 What are our goals?
- 1.4 A roadmap
- 1.5 A Primer
- 1.5.1 String notations and definitions
- 1.5.2 Graph theory
- 1.5.3 Running times
- 1.5.4 Sets and combinatorics
- 1.5.5 Numbers
- 1.5.6 Data structures
- 1.5.7 Probability
- I. Exact Matching
- 2. The Z Algorithm
- 2.1 The exact matching problem
- 2.2 Simple (slow) solution
- 2.3 The Z algorithm
- 2.4 Computing the Z values
- 2.5 Summary and notes
- 2.6 Exercises
- 3. Boyer-Moore
- 3.1 High-level description
- 3.1.1 First rule: Next Matching Character
- 3.1.2 Second rule: Good Shift Rule
- 3.1.3 Complete algorithm
- 3.2 Computing the Ri values for the next matching character rule
- 3.3 Formalizing the Good Shift Rule
- 3.4 Implementing the Good Shift Rule
- 3.4.1 Computing the L(i) values
- 3.4.2 Computing the l(i) values
- 3.5 Summary and notes
- 3.6 Exercises
- 4. Knuth-Morris-Pratt
- 4.1 KMP via deterministic finite automata
- 4.1.1 The KMP DFA
- 4.1.2 Using the memo array for string search
- 4.1.3 Correctness and running time
- 4.1.4 Computing memo
- 4.2 KMP via the Z-values
- 4.2.1 Running time of the spmi version of KMP
- 4.2.2 Computing the spmi array
- 4.3 Summary and notes
- 4.4 Exercises
- 5. Seminumerical String Matching
- 5.1 Rabin-Karp fingerprinting
- 5.1.1 Computing p
- 5.1.2 Computing ts for every position s
- 5.1.3 Time for addition, product, and comparison
- 5.1.4 Quantifying and reducing false positives
- 5.1.5 Reducing the error probability
- 5.2 The Shift-And algorithm
- 5.2.1 Space and runtime
- 5.2.2 Extension to approximate matching
- 5.3 Summary and notes
- 5.4 Exercises
- 6. Searching for Multiple Patterns
- 6.1 Aho-Corasick-a prefix-based approach
- 6.1.1 Search
- 6.1.2 Handling patterns contained in other patterns
- 6.1.3 Running time
- 6.1.4 Computing f
- 6.2 Wu-Manber-a suffix-based approach
- 6.3 Summary and notes
- 6.4 Exercises
- II. Edit Distance
- 7. Edit Distance for Inexact Matching
- 7.1 Edit distance and alignments
- 7.2 The string alignment problem
- 7.2.1 Algorithm for minimum cost alignments
- 7.2.2 Implementing the recursive algorithm
- 7.3 Dynamic programming
- 7.4 Finding the actual alignment: traceback
- 7.5 Local and semi-global alignment
- 7.5.1 Local alignment
- 7.5.2 Semi-global alignment
- 7.6 Summary and notes
- 7.7 Exercises
- 8. Edit Distance in Linear Space
- 8.1 Using linear space to compute edit distance values
- 8.2 Finding the actual alignment in linear space
- 8.2.1 Hirschberg's algorithm
- 8.2.2 Proof of running time
- 8.3 Summary and notes
- 8.4 Exercises
- 9. Faster Edit Distance via the "Four Russians" Trick
- 9.1 Dynamic programming blocks
- 9.2 Precomputing f
- 9.2.1 Offset encoding
- 9.2.2 Number of possible inputs to f
- 9.2.3 Storing f for quick access
- 9.3 Total running time
- 9.4 Finding the alignment
- 9.5 Summary and notes
- 9.6 Exercises
- 10. General and Affine Gap Penalties
- 10.1 General gap penalties
- 10.2 Affine gap penalties
- 10.3 Why do we "need" 3 matrices?
- 10.4 Summary and notes
- 10.5 Exercises
- 11. Output-Sensitive Algorithms for Edit Distance
- 11.1 Banded alignment
- 11.2 Myers' Minimum Edit Script algorithm
- 11.3 Landau-Vishkin*
- 11.3.1 The k differences problem
- 11.3.2 An alternative dynamic program
- 11.3.3 The Landau-Vishkin algorithm
- 11.4 Summary and notes
- 11.5 Exercises
- 12. Aligning Multiple Strings
- 12.1 Multiple sequence alignments
- 12.2 Progressive alignment heuristic
- 12.3 Approximation algorithm for one MSA formulation
- 12.4 Summary and notes
- 12.5 Exercises
- III. Data Structures
- 13. Suffix Trees
- 13.1 The fundamental string data structure
- 13.2 Tries
- 13.3 Defining suffix trees
- 13.4 Applications of suffix trees
- 13.4.1 Simple queries
- 13.4.2 Longest common substring of two strings
- 13.4.3 Generalized suffix trees
- 13.4.4 All pairs suffix-prefix matches
- 13.5 Summary and notes
- 13.6 Exercises
- 14. More Applications of Suffix Trees
- 14.1 Lempel-Ziv compression
- 14.2 Longest common extension
- 14.3 Finding palindromes
- 14.4 Finding all k-mismatch occurrences of a pattern
- 14.5 Longest common substring with online queries
- 14.6 Summary and notes
- 14.7 Exercises
- 15. Suffix Tree Construction
- 15.1 Constructing suffix tries
- 15.2 Ukkonen's algorithm
- 15.2.1 Updating the tree
- 15.2.2 Starting each phase
- 15.2.3 Example run of Ukkonen's algorithm
- 15.2.4 Running time of Ukkonen's algorithm
- 15.3 Summary and notes
- 15.4 Exercises
- 16. Suffix Arrays
- 16.1 Suffix tree operations in less space
- 16.2 Suffix array suffix tree
- 16.3 Searching for substrings using the suffix array
- 16.4 Faster search
- 16.4.1 Obtaining the lcp(i, j) values
- 16.5 Summary and notes
- 16.6 Exercises
- 17. Suffix Array Construction
- 17.1 An O(|T| log |T|) algorithm for suffix array construction
- 17.1.1 Radix sort
- 17.2 A linear-time construction algorithm
- 17.2.1 Running time
- 17.3 Summary and notes
- 17.4 Exercises
- 18. Burrows-Wheeler Transform
- 18.1 Definition of the BWT
- 18.2 Why is this useful for compression
- 18.3 Recovering the original string from its BWT
- 18.3.1 The LF mapping
- 18.3.2 Faster inversion algorithm
- 18.4 Faster computation of the BWT
- 18.5 Applications of the BWT
- 18.5.1 Search
- 18.5.2 Compression
- 18.6 Summary and notes
- 18.7 Exercises
- 19. RRR Compressed Bit Vectors
- 19.1 Rank and select operations
- 19.2 RRR bit vector data structure
- 19.3 Computing rank1 (S, i)
- 19.4 Compactly encoding the RRR information
- 19.4.1 Prefix-sum data structure
- 19.4.2 Space to store the vectors of integers
- 19.4.3 Space for the tables
- 19.4.4 Space for the p values
- 19.5 Supporting the select operation*
- 19.5.1 The select data structures
- 19.5.2 Answering select queries
- 19.6 Summary and notes
- 19.7 Exercises
- 20. Wavelet Trees
- 20.1 The wavelet tree data structure
- 20.1.1 The access S[i] operation
- 20.1.2 The rankc(S, i) operation
- 20.1.3 The selectc(S, j) operation
- 20.1.4 Running times
- 20.2 Applications of wavelet trees
- 20.2.1 Inverted indices
- 20.2.2 Document retrieval
- 20.2.3 Storing graphs
- 20.3 Summary and notes
- 20.4 Exercises
- 21. The FM Index for Compressed Searching
- 21.1 BWT and wavelet trees
- 21.2 A more compression-focused approach
- 21.2.1 Computing rank on the compressed string
- 21.2.2 Space usage for count arrays
- 21.2.3 Computing occurrence counts
- 21.2.4 Computing counts within a block
- 21.3 Summary and notes
- 21.4 Exercises
- 22. Storing Text for Editing
- 22.1 Gap buffers
- 22.2 Piece tables
- 22.3 Ropes
- 22.4 Summary and notes
- 22.5 Exercises
- IV. Sketching
- 23. Locality Sensitive Hashing
- 23.1 Estimating similarity between strings
- 23.2 An LSH family for the Jaccard similarity
- 23.3 An LSH family for the weighted Jaccard
- 23.4 An LSH family for a cosine-related similarity
- 23.5 A gapped LSH family for edit distance
- 23.6 Summary and notes
- 23.7 Exercises
- 24. Bloom Filters
- 24.1 How do Bloom filters work
- 24.2 How should we set the filter parameters
- 24.3 Notes on and extensions to Bloom filters
- 24.4 Cascading Bloom filters
- 24.5 Summary and notes
- 24.6 Exercises
- 25. Sketching with Minimizers
- 25.1 Accelerating read overlap computation
- 25.2 Definition of minimizers
- 25.3 Properties of minimizers
- 25.4 Expected density
- 25.5 Random minimizers
- 25.6 Selecting minimizers
- 25.7 De Bruijn sequences and de Bruijn graphs
- 25.8 Set of unavoidable words
- 25.9 Summary and notes
- 25.10 Exercises
- V. Graphs and Generative Models
- 26. Regular Grammars and the Chomsky Hierarchy
- 26.1 Regular grammars
- 26.1.1 Finite state automata
- 26.1.2 Deterministic vs. non-deterministic FSAs
- 26.1.3 Regular grammars and regular expressions
- 26.1.4 Summary of regular grammars
- 26.2 Context-free grammars
- 26.2.1 CFG parse trees
- 26.2.2 Chomsky normal form
- 26.3 Context-sensitive and unrestricted grammars
- 26.4 Summary and notes
- 26.5 Exercises
- 27. Hidden Markov Models
- 27.1 Definition of hidden Markov models
- 27.2 HMM decoding problems
- 27.3 Viterbi algorithm
- 27.4 The Forward-Backward algorithm
- 27.4.1 Forward algorithm
- 27.4.2 Backward algorithm
- 27.5 Estimating HMM parameters
- 27.5.1 Labeled training
- 27.5.2 Viterbi training
- 27.5.3 Baum-Welch algorithm
- 27.6 Summary and notes
- 27.7 Exercises
- 28. Stochastic Context-Free Grammars
- 28.1 The Inside algorithm
- 28.2 Cocke-Younger-Kasami (CYK) algorithm
- 28.3 Estimating the production probabilities
- 28.3.1 The Outside algorithm
- 28.4 Summary and notes
- 28.5 Exercises
- 29. Genome Graphs
- 29.1 Some types of genome graphs
- 29.1.1 De Bruijn graphs
- 29.1.2 Colored de Bruijn graphs
- 29.1.3 Variation graphs
- 29.2 Constructing genome graphs via compression
- 29.3 Alignment of strings to genome graphs
- 29.3.1 Alignment graph
- 29.4 Comparing genome graphs
- 29.5 Summary and notes
- 29.6 Exercises
- 30. BWT on Labeled Trees
- 30.1 String trees
- 30.2 String tree BWT
- 30.2.1 The interval corresponding to the ith node
- 30.2.2 A child interval
- 30.3 Extending to substring intervals
- 30.4 Backwards search
- 30.5 Summary and notes
- 30.6 Exercises
- 31. Transformers
- 31.1 Embeddings
- 31.2 Encoder-decoder architectures
- 31.3 Feedforward neural networks
- 31.4 Attention
- 31.5 Transformers
- 31.5.1 The transformer encoder
- 31.5.2 The transformer decoder
- 31.6 Training and using transformers
- 31.7 Summary and notes
- 31.8 Exercises
- VI. Miscellaneous
- 32. Huffman Codes
- 32.1 Prefix-free codes
- 32.1.1 UTF-8 encodings
- 32.2 Huffman coding algorithm
- 32.2.1 Proof of optimality
- 32.2.2 Linear-time code construction
- 32.3 Summary and notes
- 32.4 Exercises
- 33. Shortest Superstring
- 33.1 Relationship to the traveling salesman problem
- 33.2 The Cycle Cover algorithm
- 33.3 Some lemmas about repeated strings
- 33.4 The Cycle Cover algorithm gives a 4-approximation
- 33.5 Summary and notes
- 33.6 Exercises
- 34. Limits on What's Possible
- 34.1 A brief introduction to NP-completeness
- 34.2 Shortest superstring is NP-complete
- 34.2.1 The reduction
- 34.2.2 The proof
- 34.2.3 Additional notes
- 34.3 Shortest supersequence is NP-complete
- 34.4 MSA with SP-score is NP-complete
- 34.5 Summary and notes
- 34.6 Exercises
- 35. Conclusion
- Bibliography
- List of Algorithms
- Index
System requirements
File format: PDF
Copy protection: Watermark-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Use the free software Adobe Reader, Adobe Digital Editions, or any other PDF viewer of your choice (see eBook Help).
- Tablet/Smartphone (Android; iOS): Install the free app Adobe Digital Editions or another reading app for eBooks, e.g., PocketBook (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (only limited: Kindle).
The file format PDF always displays a book page identically on any hardware. This makes PDF suitable for complex layouts such as those used in textbooks and reference books (images, tables, columns, footnotes). Unfortunately, on the small screens of e-readers or smartphones, PDFs are rather annoying, requiring too much scrolling.
This eBook uses Watermark-DRM, a „soft” copy protection. This means that there are no technical restrictions to prevent illegal distribution. However, there is a personalised watermark embedded in the eBook that can be used to identify the purchaser of the eBook in the event of misuse and to provide evidence for legal purposes.
For more information, see our eBook Help page.