Paper deep dive
Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
Rong Zhou, Houliang Zhou, Yao Su, Brian Y. Chen, Yu Zhang, Lifang He, Alzheimer's Disease Neuroimaging Initiative
Abstract
Abstract:Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis, yet clinical datasets frequently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modalities through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent representations while attending to available imaging data and clinical metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, coupled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation quality and maintains robust diagnostic performance even under extreme 80\% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at this https URL
Tags
Links
- Source: https://arxiv.org/abs/2603.09931v1
- Canonical: https://arxiv.org/abs/2603.09931v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%
Last extracted: 3/13/2026, 1:06:33 AM
Summary
ACADiff is a novel latent diffusion framework designed for multimodal brain image synthesis and missing modality imputation in Alzheimer's disease research. It utilizes adaptive multi-source fusion, semantic clinical guidance via GPT-4o-encoded prompts, and specialized generators to synthesize sMRI, FDG-PET, and AV45-PET modalities, demonstrating superior performance and diagnostic robustness even under 80% missing data scenarios.
Entities (7)
Relation Signals (4)
ACADiff → evaluatedon → ADNI
confidence 100% · Evaluated on ADNI subjects, ACADiff achieves superior generation quality
ACADiff → usesguidance → GPT-4o
confidence 100% · coupled with semantic clinical guidance via GPT-4o-encoded prompts.
ACADiff → imputesmodality → sMRI
confidence 95% · Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET.
ACADiff → diagnoses → Alzheimer's disease
confidence 90% · Multimodal neuroimaging provides complementary insights for Alzheimer's disease diagnosis
Cypher Suggestions (2)
Find all imaging modalities supported by the ACADiff framework · confidence 95% · unvalidated
MATCH (f:Framework {name: 'ACADiff'})-[:IMPUTES_MODALITY]->(m:Modality) RETURN m.nameIdentify datasets used to evaluate the framework · confidence 95% · unvalidated
MATCH (f:Framework {name: 'ACADiff'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.nameFull Text
20,260 characters extracted from source content.
Expand or collapse full text
ADAPTIVE CLINICAL-AWARE LATENT DIFFUSION FOR MULTIMODAL BRAIN IMAGE GENERATION AND MISSING MODALITY IMPUTATION Rong Zhou 1 , Houliang Zhou 1 , Yao Su 2 , Brian Y. Chen 1 , Yu Zhang 3,4,5 , Lifang He 1 , Alzheimer’s Disease Neuroimaging Initiative 6 1 Department of Computer Science and Engineering, Lehigh University, PA, USA 2 Worcester Polytechnic Institute, MA, USA 3 Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, CA, USA 4 Wu Tsai Neurosciences Institute, Stanford University, CA, USA 5 Stanford Institute for Human-Centered AI, Stanford, CA, USA 6 Alzheimer’s Disease Neuroimaging Initiative ABSTRACT Multimodal neuroimaging provides complementary insights for Alzheimer’s disease diagnosis, yet clinical datasets fre- quently suffer from missing modalities. We propose ACADiff, a framework that synthesizes missing brain imaging modali- ties through adaptive clinical-aware diffusion. ACADiff learns mappings between incomplete multimodal observations and target modalities by progressively denoising latent represen- tations while attending to available imaging data and clini- cal metadata. The framework employs adaptive fusion that dynamically reconfigures based on input availability, cou- pled with semantic clinical guidance via GPT-4o-encoded prompts. Three specialized generators enable bidirectional synthesis among sMRI, FDG-PET, and AV45-PET. Evaluated on ADNI subjects, ACADiff achieves superior generation qual- ity and maintains robust diagnostic performance even under extreme 80% missing scenarios, outperforming all existing baselines. To promote reproducibility, code is available at https://github.com/rongzhou7/ACADiff. Index Terms— Latent diffusion, multimodal imaging, missing modality imputation, Alzheimer’s disease 1. INTRODUCTION Multimodal neuroimaging has become increasingly essen- tial for understanding Alzheimer’s disease (AD), as different modalities capture complementary pathological aspects [1]. For example, MRI quantifies structural brain atrophy, FDG- PET measures regional glucose metabolism, and AV45-PET reveals amyloid deposition [2]. Together, these modalities of- fer a more complete characterization of the underlying disease process compared to any single modality alone [3]. Unfortunately, this potential is limited in part because real- world datasets frequently suffer from incomplete modalities, where not all imaging scans are available for every subject due to prohibitive cost, acquisition protocol variability, or unexpected patient dropout [4]. This incompleteness limits both research insights and clinical decision-making. Recent advances in generative modeling have shown promise for synthesizing missing modalities [5]. While con- ditional GANs like Pix2Pix [6] and DS-GAN [7] provide initial solutions, they often suffer from mode collapse and training instability. Recent diffusion models [8,9,10] offer improved stability and generation quality. However, existing approaches still face key limitations: (1) lack of adaptive fusion mechanisms for varying input combinations; (2) limited integration of clinical information beyond disease labels; (3) absence of semantic understanding of medical metadata. To address these challenges, we propose ACADiff (Adap- tive Clinical-Aware Diffusion), a framework that synthesizes missing brain imaging modalities through hierarchical con- ditional diffusion. ACADiff learns the complex mapping be- tween incomplete multimodal observations and target modal- ities by progressively denoising latent representations while simultaneously attending to available imaging data and clini- cal metadata. The framework employs adaptive fusion strate- gies that dynamically reconfigure based on input availability, coupled with semantic clinical guidance that ensures disease- relevant patterns are preserved throughout the generation pro- cess. Our key contributions are: •Adaptive multi-source fusion: A single model seamlessly handles both 2→1 and 1→1 generation through dynamic conditioning that switches between cross-attention for mul- tiple inputs and projection for single modality. • Clinical-aware synthesis: Integration of disease labels and continuous cognitive scores (MMSE, ADAS13, CDR- SOB) to guide generation, ensuring synthesized images preserve diagnostic-relevant patterns. •Comprehensive bidirectional synthesis: Three special- ized generators enable all six translation directions among MRI, FDG-PET, and AV45-PET, each optimized for target- arXiv:2603.09931v1 [cs.CV] 10 Mar 2026 MRIFDGPETAV45PET DiffusionProcess Cross A ttention 푍 ! " 푍 ! 푍 " 푍 ! # 푍 " ! " 풟 Ɛ 3DU-NetDenoiser휖 ! 푍 ! $ 푍 ! $%& Cross-Modal Diffusion Framework Generate[TARGET]from [AVAILABLE] for AD patient with MMSE=22, ADAS13=28, CDR-SOB=4.5 Clinical Data Ɛ 푍 '()*+ Ɛ Ɛ 푍 ! Adaptive Image Conditioning CrossAttention 푍 $*,$ Ɛ × · Semantic Clinical Guidance MRI OR FDG PET AV4 5 PET MRI AV4 5 PET × Concatenation SkipConnection NUL L 푡 · Ɛ Encoder Decoder 풟 Timestep 푡 Consistency 푍 ! $%& · AddNoise a RemoveNoise b MRIFDGPETAV45PET MissingModality CompletedModality DiagnosisPrediction Generate Fig. 1. Overview of ACADiff. (a) Cross-modal latent diffusion with adaptive multi-source image conditioning and semantic clinical guidance via GPT-4o-encoded prompts. (b) Missing modality imputation for downstream diagnosis. specific characteristics. •Language model enhancement: GPT-4o encodes clinical data as structured prompts, providing semantic understand- ing beyond traditional embedding approaches. Experiments on ADNI subjects demonstrate that ACADiff achieves superior generation quality and maintains robust di- agnostic performance across all missing rates. Even with 80% missing data, our method consistently outperforms existing approaches, validating its clinical utility for multimodal brain image completion. These results establish ACADiff as a robust solution for clinical multimodal completion. 2. METHODS Fig. 1 illustrates our ACADiff framework, which employs la- tent diffusion to synthesize missing brain imaging modalities. The model learns to generate target modalities conditioned on available inputs through adaptive fusion and hierarchical con- ditioning, handling both 2→1 and 1→1 generation scenarios. Latent Space Construction. Training diffusion models di- rectly on 3D brain imaging is computationally demanding. We employ modality-specific 3D VAEs to map each modalityX M to compact latent representationsZ M through encodersE M , with paired decodersD M for reconstruction. Cross-Modal Diffusion. We employ denoising diffusion in the latent space to generate target modalities conditioned on available information. The forward process gradually perturbs a clean latentZ 0 M into Gaussian noiseZ T M overTtimesteps. The reverse denoising process is learned as: p θ (Z t−1 M |Z t M ,Z ¬M ,z text ,z avail ,t) =N(μ θ ,σ 2 t I), whereZ ¬M denotes the latent representations of available non-target modalities (e.g., when generatingZ C , this could beZ A ,Z B for 2→1 generation or justZ A for 1→1 gen- eration),z text is the encoded text embedding,z avail ∈ 0, 1 3 is a binary vector indicating the availability of each modality (e.g.,[1, 1, 0]means modalities A and B are available while C is missing, and the meanμ θ is parameterized using a 3D U-Net denoiser ε θ : μ θ = 1 √ α t Z t M − β t √ 1− ̄α t ε θ (Z t M ,Z ¬M ,z text ,z avail ,t) . We first minimize the noise prediction error: L diff =E ∥ε− ε θ (Z t M ,Z ¬M ,z text ,z avail ,t)∥ 2 . To ensure generated modalities align with real distributions, we incorporate a consistency regularization:L cons =∥ ˆ Z 0 M −Z 0 M ∥ . The final training objective becomes:L = L diff + λL cons , where λ balances denoising and reconstruction objectives. Hierarchical Adaptive Conditioning. The diffusion model integrates three complementary conditions that work hierarchi- cally to enable adaptive generation: (1) Adaptive Image Conditioning. Available modalities Z ¬M are first fused based on their availability pattern: Z fused = ( CrossAttn(Z i ,Z j ), if P z avail = 2 Proj(Z i ),if P z avail = 1 whereZ i ,Z j ∈ Z ¬M denote available modalities. CrossAttn applies multi-head attention between spatially-pooled features to enable inter-modal information exchange, while Proj per- forms learnable 3D convolution for single-modality adaptation. The fused features are concatenated with the noisy target latent Z t M as input to the U-Net denoiser:ε θ ([Z t M ;Z fused ],t,z text ), where[; ]denotes channel-wise concatenation, enabling early fusion of cross-modal information. (2) Semantic Clinical Guidance. Clinical data including disease diagnosis and cognitive scores (MMSE, ADAS13, CDR-SOB) are composed into structured prompts: "Gen- erate [TARGET] from [AVAILABLE] for AD patient with MMSE=22, ADAS13=28, CDR-SOB=4.5". These prompts are encoded by a pretrained language model:z text =E(prompt), then fused into the decoder through cross-attention:F l ← F l +g l (CrossAttn(F l ,z text )), enabling semantic guidance that aligns generation with disease-specific patterns. (3) Temporal Modulation The timestept ∈ 0, 1,...,T indicates the current noise level during denoising, where larger tcorresponds to noisier inputs requiring stronger denoising. The timestep controls denoising dynamics across all layers: Pix2Pix DS-GAN LDM PASTA FICD ACADiff-emb ACADiff 20.0 22.5 25.0 27.5 PSNR↑ Pix2Pix DS-GAN LDM PASTA FICD ACADiff-emb ACADiff 0.75 0.80 0.85 0.90 SSIM↑ Pix2Pix DS-GAN LDM PASTA FICD ACADiff-emb ACADiff 0.64 0.72 0.80 NMI↑ Pix2Pix DS-GAN LDM PASTA FICD ACADiff-emb ACADiff 0.00 0.01 0.02 0.03 MAE↓ Fig. 2. Image generation performance across methods. Higher PSNR/SSIM/NMI and lower MAE indicate better quality. F l ← γ(t)F l + β(t), enabling the model to adaptively adjust its denoising strength at each diffusion step. Training and Inference. During training, modality dropout creates diverse scenarios: each sample randomly uses either two modalities (2→1) or one modality (1→1) to generate the target. At inference, the model adapts viaz avail , applying cross-modal attention for 2→1 or projection for 1→1. After iterative denoising, the decoder produces full-resolution output: ˆ X M = D M ( ˆ Z 0 M ). We randomly drop clinical information, enabling generation without clinical guidance when needed. 3. EXPERIMENTS AND RESULTS Data Acquisition and Preprocessing. We utilized 1,028 sub- jects from the ADNI cohort [4], including 198 AD, 495 MCI, and 335 HC with sMRI, FDG-PET, and AV45-PET. sMRI underwent skull stripping, intensity normalization, and nonlin- ear registration to MNI space. PET scans were co-registered to MRI using the same transformation. All volumes were cropped to 160×180×160 voxels and normalized to [−1, 1]. To avoid data leakage, we split the 1,028 subjects into two independent sets: 600 for generator training (10% validation, 10% test). and 428 for classification experiments. From the classification set (428 subjects), we reserved 128 as a held- out test set and used the remaining 300 for classifier training (10% for validation). For classifier training, we simulated missing-modality scenarios where 20%, 40%, 60%, or 80% of the 300 training subjects had 1-2 randomly selected modalities removed, then imputed using the trained generators. Experimental Settings. For generation, we evaluate voxel- level synthesis using MAE, PSNR, SSIM, and NMI within a brain mask. For classification, we measure Accuracy (ACC), Sensitivity (SEN), Specificity (SPE), and AUC. Five methods are adapted for comparison. (1) Pix2Pix [6]: A conditional GAN with 3D U-Net generator trained with adversarial and L 1 losses, extended to handle multi-channel concatenation of available modalities. (2) DS-GAN [7]: A disease-aware GAN that incorporates disease labels as auxiliary conditioning and employs spectral normalization to stabilize training. (3) LDM [8]: A latent diffusion model that encodes images into the same compact latent space (20× 22× 20) as our method but uses standard concatenation for multi-modal conditioning without adaptive fusion. (4) PASTA [10]: A pathology-aware diffusion model with dual-arm architecture and cycle consis- tency, modified to handle multi-source inputs. (5) FICD [9]: A constrained diffusion model with voxel-wise functional align- ment losses for metabolic consistency. Table 1. Classification performance for AD vs. HC under different missing data strategies. Missing RateMethodACCAUCSENSPE 0%Oracle (Real)0.920±0.0290.943±0.0230.889±0.0310.908±0.028 20% Drop0.825±0.0420.852±0.0400.788±0.0430.806±0.041 Mean0.798±0.0450.826±0.0430.761±0.0460.779±0.044 Pix2Pix0.865±0.0350.893±0.0320.821±0.0350.848±0.035 DS-GAN0.851±0.0390.882±0.0370.805±0.0390.786±0.038 LDM0.885±0.0360.902±0.0310.817±0.0330.861±0.038 PASTA0.883±0.0360.900±0.0320.820±0.0350.850±0.031 FICD0.859±0.0400.898±0.0310.807±0.0370.848±0.040 ACADiff-emb(ours)0.891±0.0320.904±0.0310.825±0.0330.865±0.036 ACADiff (ours)0.894±0.035 0.910±0.026 0.827±0.034 0.868±0.031 40% Drop0.768±0.0480.795±0.0460.731±0.0490.749±0.047 Mean0.742±0.0500.769±0.0480.705±0.0510.723±0.049 Pix2Pix0.853±0.0390.882±0.0380.807±0.0400.841±0.039 DS-GAN0.847±0.0360.874±0.0370.794±0.0380.776±0.037 LDM0.877±0.0320.892±0.0320.809±0.0370.850±0.036 PASTA0.875±0.0330.897±0.0310.815±0.0350.848±0.034 FICD0.858±0.0360.888±0.0320.798±0.0350.841±0.034 ACADiff-emb(ours)0.886±0.0310.902±0.0300.821±0.0370.851±0.035 ACADiff (ours)0.889±0.031 0.906±0.025 0.823±0.032 0.854±0.037 60% Drop0.695±0.0530.722±0.0510.658±0.0540.676±0.052 Mean0.663±0.0540.690±0.0520.626±0.0550.644±0.053 Pix2Pix0.804±0.0370.856±0.0390.780±0.0380.809±0.038 DS-GAN0.811±0.0380.846±0.0370.778±0.0380.773±0.040 LDM0.842±0.0350.871±0.0340.789±0.0370.827±0.035 PASTA0.851±0.0350.880±0.0330.791±0.0350.829±0.033 FICD0.839±0.0370.859±0.0370.788±0.0350.831±0.035 ACADiff-emb(ours)0.870±0.0330.881±0.0320.808±0.0370.836±0.038 ACADiff (ours)0.878±0.030 0.883±0.029 0.819±0.036 0.841±0.038 80% Drop0.582±0.0580.609±0.0560.545±0.0590.563±0.057 Mean0.551±0.0600.578±0.0580.514±0.0610.532±0.059 Pix2Pix0.718±0.0460.746±0.0470.675±0.0480.668±0.047 DS-GAN0.724±0.0450.700±0.0520.670±0.0480.657±0.048 LDM0.764±0.0490.757±0.0490.683±0.0480.702±0.049 PASTA0.759±0.0480.754±0.0490.679±0.0480.696±0.046 FICD0.722±0.0510.739±0.0480.649±0.0510.660±0.052 ACADiff-emb(ours)0.768±0.0490.757±0.0520.711±0.0480.704±0.047 ACADiff (ours)0.775±0.046 0.763±0.046 0.719±0.041 0.713±0.045 Implementation details. The framework compresses brain volumes from160×180×160to20×22×20via pretrained 3D Autoencoder-KL models with frozen decoders. We implement three independent generators: Any→MRI, Any→FDG-PET, and Any→AV45-PET. The denoiser is a volumetric U-Net with GroupNorm and FiLM modulation, optimized via AdamW (lr=1×10 −4 ,T=1000). ACADiff uses GPT-4o’s text encoder for clinical prompts, while ACADiff-emb uses learnable em- beddings (dim 512) for comparison. During inference, we perform 10-fold Monte Carlo sampling. For classification, we use a 3D DenseNet-121 [11] on completed multimodal vol- umes. Generation metrics are averaged across three generators; classification over 10 independent generations. Experiments are conducted on 4 NVIDIA A100 GPUs. Results.Fig. 2 shows generation performance. ACAD- iff consistently outperforms all baselines across all metrics (PSNR 27.9, SSIM 0.911, NMI 0.859, MAE 0.014), exceed- ing the best baseline LDM. The gap between ACADiff and ACADiff-emb (∼1.8 PSNR) validates the benefit of semantic clinical encoding via GPT-4o. Table 1 presents AD vs. HC classification under varying missing rates. ACADiff main- tains robust performance across all scenarios, achieving 89.4% accuracy with 20% missing data (97.2% of oracle using com- plete real data). The advantage becomes more pronounced under extreme conditions: at 80% missing data, ACADiff pre- serves 77.5% accuracy while simple imputation fails. Among baselines, LDM performs best (76.4%) but remains below our approach. The consistent superiority over ACADiff-emb confirms that language model encoding provides meaningful clinical guidance. These results validate the clinical utility of our framework for multimodal brain image completion. 4. CONCLUSION We presented ACADiff, an adaptive clinical-aware diffu- sion framework for multimodal brain image synthesis in Alzheimer’s disease analysis. By integrating adaptive fusion mechanisms, semantic clinical guidance, and specialized generators, our method effectively handles missing modalities while preserving diagnostic information. Experiments on 1,028 ADNI subjects demonstrate superior performance across all missing rates, with ACADiff maintaining 77.5% accuracy even at 80% missing data, outperforming existing methods. These results validate the feasibility of our framework: restor- ing missing modalities via ACADiff produces more complete multimodal representations, leading to improved diagnostic accuracy. The findings highlight the potential of disease- guided diffusion models to achieve clinically faithful and robust cross-modal synthesis for Alzheimer’s disease. 5. COMPLIANCE WITH ETHICAL STANDARDS This retrospective study used publicly available human subject data from the ADNI database [4]. Ethical approval was not required per the open access data license. 6. ACKNOWLEDGEMENTS This work was supported by NIH (R01LM013519, RF1AG077820, R01MH129694, R21AG080425), NSF (IIS-2319451, MRI- 2215789), DOE (DE-SC0025801), Alzheimer’s Association (AARG-22-972541), Lehigh University (CORE and RIG), and NSF ACCESS (CIS240554). 7. REFERENCES [1]Clifford R Jack, David S Knopman, William J Jagust, et al., “Hypothetical model of dynamic biomarkers of the alzheimer’s pathological cascade,” The Lancet Neu- rology, vol. 9, no. 1, p. 119–128, 2010. [2]William Jagust, “Imaging the evolution and pathophysiol- ogy of alzheimer disease,” Nature Reviews Neuroscience, vol. 19, no. 11, p. 687–700, 2018. [3] Xi Xu, Jianqiang Li, Zhichao Zhu, Linna Zhao, Huina Wang, Changwei Song, Yining Chen, Qing Zhao, Jijiang Yang, and Yan Pei, “A comprehensive review on syn- ergy of multi-modal data and ai technologies in medical diagnosis,” Bioengineering, vol. 11, no. 3, p. 219, 2024. [4]Susanne G Mueller, Michael W Weiner, Leon J Thal, Ronald C Petersen, Clifford Jack, William Jagust, John Q Trojanowski, Arthur W Toga, and Laurel Beckett, “The alzheimer’s disease neuroimaging initiative,” Neuroimag- ing clinics, vol. 15, no. 4, p. 869–877, 2005. [5]Peter Eigenschink, Thomas Reutterer, Stefan Vamosi, Ralf Vamosi, Chang Sun, and Klaudius Kalcher, “Deep generative models for synthetic data: A survey,” IEEE Access, vol. 11, p. 47304–47320, 2023. [6] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional ad- versarial networks,” in Proceedings of the IEEE confer- ence on computer vision and pattern recognition, 2017, p. 1125–1134. [7]Yongsheng Pan, Mingxia Liu, Chunfeng Lian, Yong Xia, and Dinggang Shen, “Disease-image specific generative adversarial network for brain disease diagnosis with in- complete multi-modal neuroimages,” in International Conference on Medical Image Computing and Computer- Assisted Intervention. Springer, 2019, p. 137–145. [8]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, p. 10684–10695. [9]Minhui Yu, Mengqi Wu, Ling Yue, Andrea Bozoki, and Mingxia Liu, “Functional imaging constrained diffu- sion for brain pet synthesis from structural mri,” arXiv preprint arXiv:2405.02504, 2024. [10]Yitong Li, Igor Yakushev, Dennis M Hedderich, and Christian Wachinger, “Pasta: Pathology-aware mri to pet cross-modal translation with diffusion models,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, p. 529–540. [11]Braulio Solano-Rojas, Ricardo Villalón-Fonseca, and Gabriela Marín-Raventós, “Alzheimer’s disease early detection using a low cost three-dimensional densenet- 121 architecture,” in International conference on smart homes and health telematics. Springer, 2020, p. 3–15.