OmnimatteZero
Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

1OriginAI, 2The Hebrew University of Jerusalem, 3Bar Ilan University, 4NVIDIA Research

OmnimatteZero is the first training-free generative approach for Omnimatte, leveraging pre-trained video diffusion models to achieve object removal, extraction, and seamless layer compositions in just 0.04 sec/frame (on an A100 GPU)

Object removal

OmnimatteZero enables a training-free approach to removing objects and their associated effects from videos using off-the-shelf video diffusion models.

Foreground Extraction

OmnimatteZero also facilitates the extraction of foreground objects along with their effects using simple video latent arithmetic.

Layer Composition

Foreground layers can then be easily composed on other video layers.


Abstract

Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization.

In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos.

We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background.

Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.


OmnimatteZero: Training-free object removal

The video and its object-masked version are encoded into a latent space to identify the object's spatio-temporal footprint. Self-attention maps refine the mask and capture object-related effects. Two inpainting methods—Background-Preserving and Object-Removing—are blended using Attention-based Latent Blending. Final denoising with a video diffusion model ensures high-quality reconstruction with a well-preserved background and high PSNR values.


Removing associated object effects

(a) Pretrained video diffusion models can associate objects with their effects by analyzing self-attention maps between query and key tokens related to the object of interest [Lee et al, 2025]. We propose to directly derive the masks from the attention maps, allowing a training-free object approach for removing objects with their associated effects. (b) Unlike video diffusion models, image models do not capture object effects from still images. This aligns with the principle of common fate in Gestalt psychology.


Quantitative Comparison

Scene Movie Kubric Average Training Time (hours) Runtime (s/frame)
Metric PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓ PSNR ↑ LPIPS ↓
ObjectDrop 28.05 0.124 34.22 0.083 31.14 0.104 - -
Video RePaint [LTXVideo]* 20.13 0.252 21.15 0.289 20.64 0.271 0 0.04
Video RePaint [Wan2.1]* 21.44 0.244 24.16 0.261 22.8 0.253 0 3.2
Lumiere inpainting 26.62 0.148 31.46 0.157 29.04 0.153 - 9
Propainter 27.44 0.114 34.67 0.056 31.06 0.085 - 0.083
DiffuEraser 29.51 0.105 35.19 0.048 32.35 0.077 - 0.8
Ominmatte 21.76 0.239 26.81 0.207 24.29 0.223 3 2.5
D2NeRF - - 34.99 0.113 - - 3 2.2
LNA 23.10 0.129 - - - - 8.5 0.4
OmnimatteRF 33.86 0.017 40.91 0.028 37.38 0.023 6 3.5
Generative Omnimatte 32.69 0.030 44.07 0.010 38.38 0.020 - 9
OmnimatteZero [LTXVideo] (Ours) 34.11 0.015 44.07 0.010 39.09 0.012 0 0.04
OmnimatteZero [Wan2.1] (Ours) 33.89 0.019 44.25 0.008 39.02 0.013 0 3.2
OmnimatteZero outperforms all omnimatte and video inpainting methods, achieving the best PSNR and LPIPS without training or per-video optimization. It also runs significantly faster, with Omnimat- teZero [LTXVideo] at 0.04s per frame. ”-” denotes missing values due to unreported data or unavailable public code.

BibTeX

@inproceedings{samuel2025omnimattezero,
  author    = {Dvir Samuel and Matan Levy and Nir Darshan and Gal Chechik and Rami Ben-Ari},
  title     = {OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models},
  booktitle = {arxiv},
  year      = {2025}
}