Unified Diffusion Embeddings for LargeScale Retrieval-Augmented 3D Scene
Understanding
DOI: 10.4321/ude-2025-0412
Corresponding Author:
Dr. Addison V. Langston
Department of Synthetic Cognition
Westbridge Institute of Technology
Email: alangston@westbridge-syntech.edu
Abstract
We introduce Unified Diffusion Embeddings (UDE), a novel framework that fuses retrievalaugmented memory structures with continuous diffusion processes to enhance large-scale 3D
scene understanding. By bridging discrete retrieval spaces with continuous diffusion dynamics,
UDE achieves robust zero-shot generalization across multi-modal benchmarks. Extensive
evaluations on OmniBench-9000 and SceneFold-3D demonstrate that UDE outperforms baseline
models by substantial margins under few-shot and noisy retrieval conditions. Our method also
supports adaptive curriculum training for cross-domain transfer learning. Code and models will
be made publicly available upon acceptance.
1. Introduction
Large-scale scene understanding presents fundamental challenges across vision, language, and
robotics domains. Despite remarkable progress with retrieval-augmented language models [1,2]
and latent diffusion architectures [3], existing approaches struggle with modality entanglement,
especially in retrieval-augmented 3D scene settings.
To address these limitations, we propose Unified Diffusion Embeddings (UDE), a hybrid
method that learns to denoise retrieved memory elements using hierarchical diffusion schedules
and unified token embeddings. Our hypothesis is grounded in recent findings that suggest
diffusion-based noise injection facilitates more stable representation learning [4].
The key intuition behind UDE is that retrieval is inherently noisy and uncertain; thus, it is natural
to model retrieved context as a latent variable subject to learned diffusion perturbations. By
explicitly treating memory as a dynamic stochastic process, UDE enhances scene reconstruction,
query answering, and downstream decision-making.
Contributions:
•
•
•
We propose the first diffusion-based retrieval denoising framework for 3D scene
understanding.
We introduce a Multi-Scale Retrieval Denoiser (MRD) that adaptively fuses
hierarchical memory traces.
We conduct comprehensive experiments across synthetic and real-world datasets,
demonstrating consistent gains over competitive baselines.
2. Related Work
2.1 Retrieval-Augmented Models:
Retrieval-augmented generation (RAG) [1] and RETRO [2] have demonstrated that external
memory improves language modeling. However, they primarily focus on text domains and lack
robust handling of retrieval noise.
2.2 Diffusion Models:
Diffusion probabilistic models [3,5] have gained prominence for their ability to model complex
distributions through learned noise schedules. Latent diffusion [6] reduces the computational
burden by operating in compressed feature spaces.
2.3 Scene Understanding:
3D scene parsing [7,8] typically relies on supervised learning with limited external memory
augmentation. Neural fields [9] and point cloud transformers [10] offer new directions but do not
integrate retrieval or diffusion mechanisms.
3. Methodology
3.1 Problem Setup
Given an input query qqq (e.g., a partial 3D scan or natural language description), we retrieve a
set of memory elements M={m1,…,mK}\mathcal{M} = \{ m_1, \ldots, m_K \}M={m1,…,mK}
from an external database. Our goal is to predict a scene representation SSS conditioned on qqq
and M\mathcal{M}M.
3.2 Unified Diffusion Embeddings (UDE)
At the core of UDE is a denoising diffusion process that refines retrieved memory embeddings
over discrete timesteps ttt.
•
•
Retrieval Encoder: Encodes mim_imi into latent vectors hi=E(mi)h_i = E(m_i)hi=E(mi
).
Diffusion Scheduler: Applies noise according to:
h~i(t)=αthi+1−αtϵ,ϵ∼N(0,I)\tilde{h}_i^{(t)} = \sqrt{\alpha_t} h_i + \sqrt{1-\alpha_t} \epsilon,
\quad \epsilon \sim \mathcal{N}(0, I)h~i(t)=αt