Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Dvir Samuel, Issar Tzachor, Matan Levy, Michael Green, Gal Chechik, Rami Ben-Ari

Real-Time Generation Simulation

Simulating 2-minute (2880 frames @ 24 FPS) video generation with different attention mechanisms (KV-cache is not bounded)

00:00.0

Dense Attention (FlashAttention3)

✓ Complete
Current FPS: 0.0 Frames: 0 / 2880

Ours

✓ Complete
Current FPS: 0.0 Frames: 0 / 2880

FlowCache + RadialAttention

✓ Complete
Current FPS: 0.0 Frames: 0 / 2880

Generation FPS vs Generated Frames

Peak GPU Memory vs Generated Frames

Ours
Dense (FlashAttention3)
FlowCache + RadialAttention

Abstract

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame.

Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to ×5–×10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

Motivation

Autoregressive video diffusion enables streaming generation: instead of denoising an entire video at once, the model repeatedly generates the next segment conditioned on the growing history. This unlocks multi-minute generation and video world models, but it also exposes a core bottleneck: the attention KV-cache grows with every generated frame. As context grows, attention becomes slower and memory consumption rises, eventually forcing short context windows that hurt long-range consistency.

What’s the root cause?
Recall and Density patterns in temporal attention
Attention is naturally sparse. Even in the autoregressive setting, most attention mass concentrates on a small subset of entries, suggesting significant headroom for training-free sparsification.
Sparse Self-Attention visualization
Self-attention features repeat over time. Queries/keys are largely semantic (foreground vs. background clusters) and many keys persist across frames, motivating temporal KV-cache compression and semantic matching.
Cross-Attention visualization
Cross-attention is frame-selective. For a long prompt, only a few tokens dominate cross-attention for a given frame (e.g., “cat” early, “van” during occlusion, “dog” after transformation). This motivates selecting a frame-specific subset of prompt tokens.

These observations lead to our goal: keep the quality of dense attention while making long rollouts fast and memory-stable. We do this with a plug-and-play, training-free attention framework built around TempCache, AnnSA, and AnnCA.

Method

Our approach is a training-free, drop-in replacement for the attention modules inside autoregressive video diffusion and world models. The high-level idea is simple: don’t attend to everything, and don’t store everything. We reinterpret attention as a fast approximate nearest-neighbor (ANN) retrieval problem and use it to (1) bound KV-cache growth over time, (2) sparsify self-attention, and (3) prune cross-attention.

At a glance

5.1 Attention as ANN Search

Standard attention computes scores against all keys in the cache, which becomes expensive as the cache grows. Instead, for each query we first retrieve a small candidate set of keys that are likely to have the largest dot-products (via lightweight ANN such as locality-sensitive hashing or quantized similarity search), and then compute attention only over those candidates using sparse attention kernels.

TempCache: Temporal KV-Cache Compression

TempCache targets the main scaling issue in autoregressive inference: the KV cache keeps growing. We observe that many key features persist across frames (near-duplicates), so we can merge them. To find which tokens correspond over time, we use attention-derived temporal correspondence: for each current-frame query, we retrieve its top-1 nearest neighbor key from previous frames using the same ANN machinery. Keys that track the same content are grouped, and we keep a single representative (typically the most recent) in the cache.

Temporal correspondance
Temporal correspondence → cache compression. Tokens that represent the same semantic region persist over time. TempCache matches them efficiently and merges redundant entries to bound KV-cache growth.

AnnCA: Approximate Nearest-Neighbor Cross-Attention

In cross-attention, latent queries attend to all prompt tokens at every layer and denoising step—even though most tokens are irrelevant for the current frame. AnnCA prunes prompt tokens per frame: we project current-frame latent queries and prompt keys into a shared ANN space and keep only prompt tokens that have a nearby latent query. The result is cross-attention computed on a much smaller, frame-relevant prompt subset.

AnnSA: Approximate Nearest-Neighbor Self-Attention

Self-attention in video diffusion is highly structured: tokens mostly attend to semantically related regions. AnnSA exploits this by assigning tokens to ANN buckets (semantic neighborhoods) and restricting each query to attend only to keys within the same bucket(s). This yields a sparse attention pattern that remains high-recall while dramatically reducing dot-products and bandwidth.

Why it works in long rollouts

Sparse-Attention Comparisons

Side-by-side comparison of different sparse attention methods with autoregressive video diffusion model


Our approach preserves the visual fidelity and temporal consistency of Dense FA3 across diverse prompts, while current sparsification baselines (SVG1/2) often exhibit artifacts and drift; RadialAttention is more stable but still degrades in challenging scenes

KV-Cache Compression & Sparse Attention

Side-by-side comparison of different kv-cache compression combined with sparse attention methods

📄 Citation

@article{samuel2026fastautoregressive, title={Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention}, author={Dvir Samuel, Issar Tzachor, Matan Levy, Michael Green, Gal Chechik, Rami Ben-Ari}, year={2026} }