Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Abstract

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame.

Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to ×5–×10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

Motivation

Autoregressive video diffusion enables streaming generation: instead of denoising an entire video at once, the model repeatedly generates the next segment conditioned on the growing history. This unlocks multi-minute generation and video world models, but it also exposes a core bottleneck: the attention KV-cache grows with every generated frame. As context grows, attention becomes slower and memory consumption rises, eventually forcing short context windows that hurt long-range consistency.

What’s the root cause?

Self-attention redundancy: many cached keys are near-duplicates across frames, and queries/keys evolve slowly (mostly semantic), so a large fraction of dot-products are wasted.
Cross-attention redundancy: prompts are often long, yet for any given frame only a small subset of prompt tokens is actually relevant.
Long rollouts make it worse: naive KV caching causes per-frame latency and peak GPU memory to increase over time.

Recall and Density patterns in temporal attention — **Attention is naturally sparse.** Even in the autoregressive setting, most attention mass concentrates on a small subset of entries, suggesting significant headroom for *training-free* sparsification.

Sparse Self-Attention visualization — **Self-attention features repeat over time.** Queries/keys are largely semantic (foreground vs. background clusters) and many keys persist across frames, motivating temporal KV-cache compression and semantic matching.

Cross-Attention visualization — **Cross-attention is frame-selective.** For a long prompt, only a few tokens dominate cross-attention for a given frame (e.g., “cat” early, “van” during occlusion, “dog” after transformation). This motivates selecting a *frame-specific* subset of prompt tokens.

These observations lead to our goal: keep the quality of dense attention while making long rollouts fast and memory-stable. We do this with a plug-and-play, training-free attention framework built around TempCache, AnnSA, and AnnCA.

Method

Our approach is a training-free, drop-in replacement for the attention modules inside autoregressive video diffusion and world models. The high-level idea is simple: don’t attend to everything, and don’t store everything. We reinterpret attention as a fast approximate nearest-neighbor (ANN) retrieval problem and use it to (1) bound KV-cache growth over time, (2) sparsify self-attention, and (3) prune cross-attention.

At a glance

TempCache: compress the growing KV cache by merging temporally corresponding (near-duplicate) keys across frames.
AnnCA: accelerate cross-attention by keeping only prompt tokens that match the current frame’s latent queries (frame-relevant token selection).
AnnSA: accelerate self-attention by restricting each query to semantically matched keys (sparse, bucketed attention).

5.1 Attention as ANN Search

Standard attention computes scores against all keys in the cache, which becomes expensive as the cache grows. Instead, for each query we first retrieve a small candidate set of keys that are likely to have the largest dot-products (via lightweight ANN such as locality-sensitive hashing or quantized similarity search), and then compute attention only over those candidates using sparse attention kernels.

TempCache: Temporal KV-Cache Compression

TempCache targets the main scaling issue in autoregressive inference: the KV cache keeps growing. We observe that many key features persist across frames (near-duplicates), so we can merge them. To find which tokens correspond over time, we use attention-derived temporal correspondence: for each current-frame query, we retrieve its top-1 nearest neighbor key from previous frames using the same ANN machinery. Keys that track the same content are grouped, and we keep a single representative (typically the most recent) in the cache.

Temporal correspondance — **Temporal correspondence → cache compression.** Tokens that represent the same semantic region persist over time. TempCache matches them efficiently and merges redundant entries to bound KV-cache growth.

AnnCA: Approximate Nearest-Neighbor Cross-Attention

In cross-attention, latent queries attend to all prompt tokens at every layer and denoising step—even though most tokens are irrelevant for the current frame. AnnCA prunes prompt tokens per frame: we project current-frame latent queries and prompt keys into a shared ANN space and keep only prompt tokens that have a nearby latent query. The result is cross-attention computed on a much smaller, frame-relevant prompt subset.

AnnSA: Approximate Nearest-Neighbor Self-Attention

Self-attention in video diffusion is highly structured: tokens mostly attend to semantically related regions. AnnSA exploits this by assigning tokens to ANN buckets (semantic neighborhoods) and restricting each query to attend only to keys within the same bucket(s). This yields a sparse attention pattern that remains high-recall while dramatically reducing dot-products and bandwidth.

Why it works in long rollouts

Bounded cache: TempCache prevents unbounded KV growth, stabilizing both memory and latency.
Lightweight matching: ANN retrieval is fast and can be reused across layers/timesteps with minimal overhead.
Plug-and-play: no retraining or fine-tuning; the method swaps in at inference time.

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Real-Time Generation Simulation

Dense Attention (FlashAttention3)

Ours

FlowCache + RadialAttention

Generation FPS vs Generated Frames

Peak GPU Memory vs Generated Frames