OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

OmnimatteZero
Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

¹OriginAI, ²The Hebrew University of Jerusalem, ³Bar Ilan University, ⁴NVIDIA Research

Object removal

Foreground Extraction

Abstract

Omnimatte aims to decompose a given video into semantically meaningful layers, including the background and individual objects along with their associated effects, such as shadows and reflections. Existing methods often require extensive training or costly self-supervised optimization.

In this paper, we present OmnimatteZero, a training-free approach that leverages off-the-shelf pre-trained video diffusion models for omnimatte. It can remove objects from videos, extract individual object layers along with their effects, and composite those objects onto new videos.

We accomplish this by adapting zero-shot image inpainting techniques for video object removal, a task they fail to handle effectively out-of-the-box. We then show that self-attention maps capture information about the object and its footprints and use them to inpaint the object's effects, leaving a clean background.

Additionally, through simple latent arithmetic, object layers can be isolated and recombined seamlessly with new video layers to produce new videos. Evaluations show that OmnimatteZero not only achieves superior performance in terms of background reconstruction but also sets a new record for the fastest Omnimatte approach, achieving real-time performance with minimal frame runtime.

Quantitative Comparison

Scene	Movie	Kubric	Average	Training Time (hours)	Runtime (s/frame)
ObjectDrop	28.05	0.124	34.22	0.083	31.14	0.104	-	-
Video RePaint [LTXVideo]*	20.13	0.252	21.15	0.289	20.64	0.271	0	0.04
Video RePaint [Wan2.1]*	21.44	0.244	24.16	0.261	22.8	0.253	0	3.2
Lumiere inpainting	26.62	0.148	31.46	0.157	29.04	0.153	-	9
Propainter	27.44	0.114	34.67	0.056	31.06	0.085	-	0.083
DiffuEraser	29.51	0.105	35.19	0.048	32.35	0.077	-	0.8
Ominmatte	21.76	0.239	26.81	0.207	24.29	0.223	3	2.5
D2NeRF	-	-	34.99	0.113	-	-	3	2.2
LNA	23.10	0.129	-	-	-	-	8.5	0.4
OmnimatteRF	33.86	0.017	40.91	0.028	37.38	0.023	6	3.5
Generative Omnimatte	32.69	0.030	44.07	0.010	38.38	0.020	-	9
OmnimatteZero [LTXVideo] (Ours)	34.11	0.015	44.07	0.010	39.09	0.012	0	0.04
OmnimatteZero [Wan2.1] (Ours)	33.89	0.019	44.25	0.008	39.02	0.013	0	3.2

Scene

Movie

Kubric

Average

Training Time (hours)

Runtime (s/frame)

Metric

PSNR ↑

LPIPS ↓

PSNR ↑

LPIPS ↓

PSNR ↑

LPIPS ↓

ObjectDrop

28.05

0.124

34.22

0.083

31.14

0.104

Video RePaint [LTXVideo]*

20.13

0.252

21.15

0.289

20.64

0.271

0.04

Video RePaint [Wan2.1]*

21.44

0.244

24.16

0.261

22.8

0.253

3.2

Lumiere inpainting

26.62

0.148

31.46

0.157

29.04

0.153

Propainter

27.44

0.114

34.67

0.056

31.06

0.085

0.083

DiffuEraser

29.51

0.105

35.19

0.048

32.35

0.077

0.8

Ominmatte

21.76

0.239

26.81

0.207

24.29

0.223

2.5

D2NeRF

34.99

0.113

2.2

LNA

23.10

0.129

8.5

0.4

OmnimatteRF

33.86

0.017

40.91

0.028

37.38

0.023

3.5

Generative Omnimatte

32.69

0.030

44.07

0.010

38.38

0.020

OmnimatteZero [LTXVideo] (Ours)

34.11

0.015

44.07

0.010

39.09

0.012

0.04

OmnimatteZero [Wan2.1] (Ours)

33.89

0.019

44.25

0.008

39.02

0.013

3.2

@inproceedings{samuel2025omnimattezero, author = {Dvir Samuel and Matan Levy and Nir Darshan and Gal Chechik and Rami Ben-Ari}, title = {OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models}, booktitle = {arxiv}, year = {2025} }

OmnimatteZero
Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

OmnimatteZero is the first training-free generative approach for Omnimatte, leveraging pre-trained video diffusion models to achieve object removal, extraction, and seamless layer compositions in just 0.04 sec/frame (on an A100 GPU)

Object removal

OmnimatteZero enables a training-free approach to removing objects and their associated effects from videos using off-the-shelf video diffusion models.

Foreground Extraction

OmnimatteZero also facilitates the extraction of foreground objects along with their effects using simple video latent arithmetic.

Layer Composition

Foreground layers can then be easily composed on other video layers.

Abstract

OmnimatteZero: Training-free object removal

Removing associated object effects

Quantitative Comparison

BibTeX

OmnimatteZero Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

OmnimatteZero is the first training-free generative approach for Omnimatte, leveraging pre-trained video diffusion models to achieve object removal, extraction, and seamless layer compositions in just 0.04 sec/frame (on an A100 GPU)

Object removal

OmnimatteZero enables a training-free approach to removing objects and their associated effects from videos using off-the-shelf video diffusion models.

Foreground Extraction

OmnimatteZero also facilitates the extraction of foreground objects along with their effects using simple video latent arithmetic.

Layer Composition

Foreground layers can then be easily composed on other video layers.

Abstract

OmnimatteZero: Training-free object removal

Removing associated object effects

Quantitative Comparison

BibTeX

OmnimatteZero
Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models