Simulating 2-minute (2880 frames @ 24 FPS) video generation with different attention mechanisms (KV-cache is not bounded)
Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame.
Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to ×5–×10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.
Autoregressive video diffusion enables streaming generation: instead of denoising an entire video at once, the model repeatedly generates the next segment conditioned on the growing history. This unlocks multi-minute generation and video world models, but it also exposes a core bottleneck: the attention KV-cache grows with every generated frame. As context grows, attention becomes slower and memory consumption rises, eventually forcing short context windows that hurt long-range consistency.
These observations lead to our goal: keep the quality of dense attention while making long rollouts fast and memory-stable. We do this with a plug-and-play, training-free attention framework built around TempCache, AnnSA, and AnnCA.
Our approach is a training-free, drop-in replacement for the attention modules inside autoregressive video diffusion and world models. The high-level idea is simple: don’t attend to everything, and don’t store everything. We reinterpret attention as a fast approximate nearest-neighbor (ANN) retrieval problem and use it to (1) bound KV-cache growth over time, (2) sparsify self-attention, and (3) prune cross-attention.
Standard attention computes scores against all keys in the cache, which becomes expensive as the cache grows. Instead, for each query we first retrieve a small candidate set of keys that are likely to have the largest dot-products (via lightweight ANN such as locality-sensitive hashing or quantized similarity search), and then compute attention only over those candidates using sparse attention kernels.
TempCache targets the main scaling issue in autoregressive inference: the KV cache keeps growing. We observe that many key features persist across frames (near-duplicates), so we can merge them. To find which tokens correspond over time, we use attention-derived temporal correspondence: for each current-frame query, we retrieve its top-1 nearest neighbor key from previous frames using the same ANN machinery. Keys that track the same content are grouped, and we keep a single representative (typically the most recent) in the cache.
In cross-attention, latent queries attend to all prompt tokens at every layer and denoising step—even though most tokens are irrelevant for the current frame. AnnCA prunes prompt tokens per frame: we project current-frame latent queries and prompt keys into a shared ANN space and keep only prompt tokens that have a nearby latent query. The result is cross-attention computed on a much smaller, frame-relevant prompt subset.
Self-attention in video diffusion is highly structured: tokens mostly attend to semantically related regions. AnnSA exploits this by assigning tokens to ANN buckets (semantic neighborhoods) and restricting each query to attend only to keys within the same bucket(s). This yields a sparse attention pattern that remains high-recall while dramatically reducing dot-products and bandwidth.
Side-by-side comparison of different sparse attention methods with autoregressive video diffusion model
Our approach preserves the visual fidelity and temporal consistency of Dense FA3 across diverse prompts, while current sparsification baselines (SVG1/2) often exhibit artifacts and drift; RadialAttention is more stable but still degrades in challenging scenes
Side-by-side comparison of different kv-cache compression combined with sparse attention methods