Suffix Arrays
What It’s For
What problem does a suffix array solve — fast substring search and string analysis — with far less memory than a suffix tree?
The Core Idea: Sorted Suffix Indices
What does \( \text{SA}[i] \) hold, and how is the array just the starting positions of all suffixes sorted lexicographically?
Naive Construction
Why does sorting all suffixes directly cost so much, and where does the \( O(n^2 \log n) \) come from in the comparisons?
Prefix-Doubling Construction
Trace how sorting by the first \( 2k \) characters reuses the ranks from sorting by the first \( k \), doubling the compared length each round.
Rank Arrays and Radix Sort
How do (rank, next-rank) pairs let each round sort in linear time with a radix/counting sort instead of a comparison sort?
Why Doubling Terminates Fast
Why are \( \lceil \log_2 n \rceil \) rounds enough for every suffix to be fully distinguished?
The LCP Array
What does the longest-common-prefix array store between adjacent suffixes in SA, and why does it unlock most string queries?
Kasai’s Linear LCP
What’s the key observation — that LCP drops by at most one as you move from the suffix at position \( p \) to position \( p+1 \) — that makes LCP construction \( O(n) \)?
Pattern Search via Binary Search
Trace locating all occurrences of a pattern of length \( m \) by binary searching the sorted suffixes in \( O(m \log n) \).
Applications via SA + LCP
How do you get longest repeated substring, longest common substring, distinct substring count, and range-minimum-on-LCP from these arrays?
Time Complexity
Separate construction from query, and compare construction methods.
Construction: Naive vs Doubling vs Linear
Why is naive \( O(n^2 \log n) \), doubling \( O(n \log n) \) (or \( O(n \log^2 n) \) with comparison sort), and what do DC3/SA-IS achieve? Name the bottleneck each method removes.
LCP Construction
Why is Kasai’s algorithm \( O(n) \) despite seeming to recompute prefixes — what does the “drop by at most one” invariant save?
Pattern Search
Why is binary-search pattern matching \( O(m \log n) \), and how does adding LCP information cut it toward \( O(m + \log n) \)?
Space Complexity
Account for every array you keep.
The Core Arrays
SA, rank, and LCP are each one integer per character. Why is that \( O(n) \) total, and why is it far smaller (constant factor) than a suffix tree?
Construction Scratch Space
What temporary arrays does prefix doubling or a linear constructor need, and why are they still \( O(n) \)?
Suffix Array vs Suffix Tree
Why is a suffix array usually preferred in practice — memory, cache, simplicity — and what does a suffix tree still do more directly?
Pitfalls
What goes wrong without a sentinel, with off-by-one in the doubling ranks, or assuming SA alone (no LCP) answers everything?
Implementation Walkthrough
Plan prefix-doubling construction before writing it.
Initial Ranks from Characters
How do you seed the rank array from the raw characters for \( k = 1 \)? Prompts only.
One Doubling Round
What sort key (current rank, rank \( k \) ahead) do you build, how do you sort it (radix/counting), and how do you reassign ranks afterward — handling ties?
Building LCP with Kasai
What order do you process suffixes in, and how do you carry the previous LCP value forward minus one? Sketch as a prompt.
Pattern Search Routine
How do you binary-search SA comparing the pattern against a suffix prefix, and how do you find the full range of matches?
Implementation
// implement from scratch here
Summary
Recap the whole topic in a few lines — the core idea, the key complexities and trade-offs, and when to reach for it. The cheat-sheet you’d skim before an exam or interview.
My Notes
Fill this in as you learn.