Cache-Oblivious Algorithms

What It’s For

What problem do these solve — getting good memory-hierarchy performance without hard-coding the cache size or block size into the algorithm?

Oblivious vs Cache-Aware

What’s the difference between an algorithm tuned to a known \( B \) and \( M \) and one that performs well without ever naming them?

The Ideal-Cache Model

What are the model’s assumptions: two memory levels, automatic block transfers, and an optimal replacement policy?

Why Tuning-Free Is Powerful

Why does being optimal in this model imply good behavior across every level of a real multi-level hierarchy at once?

The Tall-Cache Assumption

What does \( M = \Omega(B^2) \) assume about the cache’s shape, and why do several optimal bounds quietly depend on it?

Recursive Divide-and-Conquer for Locality

Why does repeatedly halving a problem until subproblems fit in cache yield good locality “for free,” with no knowledge of cache size?

The “Free Blocking” Argument

At some recursion level a subproblem just fits in \( M \) (and below that, in \( B \)). Why does the algorithm reach that level automatically without being told where it is?

The van Emde Boas Memory Layout

How does laying out a search tree recursively (top half, then each bottom subtree contiguously) cut search I/Os, and why is the naive array-order (BFS) layout worse?

Cache-Oblivious Matrix Multiply

How does recursively blocking a matrix product by quartering each matrix achieve a strong I/O bound without choosing a tile size?

Cache-Oblivious Sorting (Funnelsort)

What is the high-level idea of merging through recursively built “funnels” to match the external sorting bound?

Time Complexity (Work)

Separate ordinary running time from I/O cost.

Why Work Stays Asymptotically Normal

Why does the recursive blocking not change the classic operation count (e.g. matrix multiply is still \( \Theta(n^3) \) work, sorting \( \Theta(n\log n) \))?

Cache Complexity (I/O Cost)

This is the whole point of the model — reason about block transfers.

Scanning and Search Bounds

Why is a linear scan \( O(n/B) \) I/Os, and why does the vEB-laid-out tree search cost \( O(\log_B n) \) like a B-tree despite being binary?

Matrix Multiply and Sorting Bounds

State the cache-oblivious matrix-multiply I/O bound and the sorting bound, and point to where the tall-cache assumption is used.

Space Complexity

Account for the recursion and any auxiliary layout.

Layout and Recursion Overhead

Does the vEB layout need extra space beyond the data? How deep does the divide-and-conquer recursion stack get?

Trade-offs vs Cache-Aware Code

What do you pay for portability — constant factors, recursion overhead — versus a hand-tiled cache-aware kernel?

Real Uses

Where does this matter — portable numeric libraries, B-tree alternatives, code that must run well across unknown hardware?

Pitfalls

Why can recursion overhead and the tall-cache assumption bite, and where does the model diverge from real caches (associativity, prefetch)?

Implementation Walkthrough

Plan one concrete example — say recursive blocked matrix multiply — before writing it.

Choosing the Recursive Split

How do you decide which dimension to halve at each step, and what’s the base case where you switch to a direct loop?

Memory Layout Decisions

For a vEB tree layout, how do you compute where each recursive sub-tree’s nodes live in the flat array? Prompts only.

Base Case Tuning

Why does a tiny base case hurt constant factors, and how do you pick a cutoff without making the algorithm cache-aware?

Implementation

// implement from scratch here

Summary

Recap the whole topic in a few lines — the core idea, the key complexities and trade-offs, and when to reach for it. The cheat-sheet you’d skim before an exam or interview.

My Notes

Fill this in as you learn.

Keyboard shortcuts

Data Structures and Algorithms