PDM

Where's Waldo: Diffusion Features For Personalized Segmentation and Retrieval

¹Bar Ilan University, ²OriginAI, ³The Hebrew University of Jerusalem, ⁴NVIDIA Research

NeurIPS 2024

Abstract

Personalized retrieval and segmentation aim to locate specific instances within a dataset based on an input image and a short description of the reference instance. While supervised methods are effective, they require extensive labeled data for training. Recently, self-supervised foundation models have been introduced to these tasks showing comparable results to supervised methods. However, a significant flaw in these models is evident: they struggle to locate a desired instance when other instances within the same class are presented. In this paper, we explore text-to-image diffusion models for these tasks. Specifically, we propose a novel approach called PDM for Personalized Diffusion Features Matching, that leverages intermediate features of pre-trained text-to-image models for personalization tasks without any additional training. PDM demonstrates superior performance on popular retrieval and segmentation benchmarks, outperforming even supervised methods. We also highlight notable shortcomings in current instance and segmentation datasets and propose new benchmarks for these tasks.

Are instance features even encoded in a pre-trained text-to-image model?

(a) Apperance Features: We found that instance appearance features are encoded in the queries (\(\mathcal{Q}^{SA}\)) and keys (\(\mathcal{K}^{SA}\)) matrices of the self-attention block. Here we show PCA visualization of features obtained from the first self-attention block in the last layer of the U-Net module, at various diffusion timesteps. Objects with similar textures and colors have similar features. The dog's color in \(I_1\) is similar to the colors of both the dog and the cat in \(I_2\), indicating textural similarity. Additionally, the localization is sharper at larger timesteps.

(b) Semantic Features: Visualization of the cross-attention map for a given prompt "dog". Note the higher region correlation (brighter colors) corresponding to the dog, while overlooking the cat in the bottom image.

@article{Samuel2024Waldo, title={Where's Waldo: Diffusion Features For Personalized Segmentation and Retrieval}, author={Dvir Samuel and Rami Ben-Ari and Matan Levy and Nir Darshan and Gal Chechik}, journal={NeurIPS}, year={2024} }

Where's Waldo: Diffusion Features For Personalized Segmentation and Retrieval

Abstract

Are instance features even encoded in a pre-trained text-to-image model?

PDM: Personalized Diffusion Features Matching

Personalized Segmentation

Personalized Retrieval

BibTeX