Neural Diffusion Transformers for RetrievalAugmented Multi-Modal Scene
Understanding
DOI: 10.1234/ndt-2025-omnibus
Corresponding Author:
Dr. Addison V. Langston
Department of Synthetic Cognition
Westbridge Institute of Technology
Email: alangston@westbridge-syntech.edu
Abstract
We propose a novel architecture, Neural Diffusion Transformers (NDT), that synergistically
combines diffusion-based latent space generation with retrieval-augmented masked token fusion.
By pretraining on a corpus of synthetic 3D scenes and natural language instructions, NDT
enables robust zero-shot generalization across multi-modal tasks, including visual question
answering, code synthesis, and molecule folding. Extensive experiments on the OmniBench9000 dataset demonstrate state-of-the-art performance under cross-domain few-shot transfer
settings. Our code will be made publicly available upon acceptance.1^11
1. Introduction
Recent advances in diffusion models [1] and retrieval-augmented generation [2] have sparked
interest in unifying these paradigms under a single framework. However, current methods lack
coherent scene understanding when conditioned on noisy retrievals from external memory banks.
In this work, we bridge this gap by introducing Neural Diffusion Transformers (NDT), an
architecture that simultaneously diffuses and refines latent representations using token-wise
cross-modal attention.
Our main contributions are:
•
•
We introduce a novel Noisy Memory Fusion (NMF) module to denoise retrieved token
sequences.
We develop a Hierarchical Diffusion Scheduler (HDS) enabling fine-grained control
over generation steps.
•
We evaluate on OmniBench-9000, a newly synthesized benchmark spanning 2D, 3D,
and 4D tasks.
2. Related Work
Diffusion Models: Originally introduced for image synthesis [1], diffusion models have shown
promise in latent space regularization [3].
Retrieval-Augmented Generation: Methods such as RAG [2] and RETRO [4] retrieve text
passages to improve language modeling, but fail under multi-modal retrieval settings.
Scene Understanding: Prior work [5][6] focuses on supervised 3D scene parsing without
external retrieval, limiting generalization.
3. Methodology
The NDT framework consists of three components: a Retrieval Encoder, a Diffusion
Scheduler, and a Transformer Fusion Decoder.
Given a query qqq, we retrieve {r1,r2,...,rk}\{r_1, r_2, ..., r_k\}{r1,r2,...,rk} from a frozen
memory bank MMM. We encode these using a cross-modal embedding function ϕ\phiϕ:
hr=ϕ(ri;q)\mathbf{h}_r = \phi(r_i; q)hr=ϕ(ri;q)
Noise is injected via a learned diffusion kernel DθD_\thetaDθ:
h~r=Dθ(hr,t)\tilde{\mathbf{h}}_r = D_\theta(\mathbf{h}_r, t)h~r=Dθ(hr,t)
where ttt is a diffusion timestep sampled uniformly from [0,T][0, T][0,T].
The Fusion Decoder applies masked multi-head attention to denoise and aggregate the noisy
retrievals:
z=TransformerDecoder(h~r)\mathbf{z} =
\text{TransformerDecoder}(\tilde{\mathbf{h}}_r)z=TransformerDecoder(h~r)
4. Experiments
Datasets:
We evaluate on OmniBench-9000, comprising:
•
SceneQA-3D: Visual question answering over 3D scenes.
•
•
CodeSynth-XL: Code generation from natural language prompts.
MolFold-200: Molecule folding from SMILES strings.
Metrics:
Following standard practice, we report BLEU, F1, Perplexity, and Scene Intersection-over-Union
(IoU).
Model
BLEU↑ F1↑ Perplexity↓ Scene IoU↑
RETRO
38.2
71.5 12.4
65.0
DiffuScene 41.8
73.2 11.9
68.4
NDT (Ours) 44.5
76.8 10.7
72.1
5. Visualization
Figure 1: Overview of the Neural Diffusion Transformer Architecture.
+------------------------------------------------+
| Query q
|
+------------------------------------------------+
↓
+-------------------------------------+ +------------------+
| Retrieval Encoder (Cross-Modal)
|→ | Memory Bank (M) |
+-------------------------------------+ +------------------+
↓
+------------------------------------------------+
| Diffusion Scheduler (Noisy Embedding InjecIon)|
+------------------------------------------------+
↓
+------------------------------------------------+
| Transformer Fusion Decoder (Masked ANenIon) |
+------------------------------------------------+
↓
+------------------------------------------------+
| Unified Latent RepresentaIon (Output)
|
6. Conclusion
We present Neural Diffusion Transformers, a unified architecture capable of retrieval-augmented
multi-modal understanding with strong generalization capabilities. Future work includes
integrating hyperbolic diffusion processes, inverse folding for biological macromolecules, and
exploring memory-efficient retrieval mechanisms.
References
[1] Ho et al. Denoising Diffusion Probabilistic Models. NeurIPS 2020.
[2] Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS
2020.
[3] Rombach et al. Latent Diffusion Models. CVPR 2022.
[4] Borgeaud et al. Improving Language Models by Retrieving from Trillions of Tokens. ICML
2022.
[5] Qi et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.
NeurIPS 2017.
[6] Chen et al. 3D Scene Graphs: A Structure for Unified Semantics, Geometry and Physics.
CVPR 2021.