Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
[Submitted on 22 Dec 2025]
View PDF of the article titled Beyond CLIP: Knowledge-enhanced multimodal converters for multimodal alignment in the diagnosis of diabetic retinopathy, by Argha Kamal Samanta and 4 other authors
View PDF HTML (beta)
a summary:Diabetic retinopathy (DR) is a major cause of preventable blindness worldwide, requiring accurate automated diagnostic systems. While general domain vision language models such as Comparative Language and Image Pretraining (CLIP) perform well in naturalistic image tasks, they face difficulties in medical domain applications, particularly in multimodal retrieval of eye images. We propose a novel knowledge-enhanced joint framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image text alignment. Our approach uses separate encoders for each modality: a vision transformer (ViT-B/16) for retinal images, a Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These methods are combined through a cotransformer with modality-specific embeddings, and are trained using multiple objectives including variational losses between method pairs, reconstruction losses for images and text, and classification losses for DR severity scores according to ICDR and SDRG schemes. Experimental results on the multi-label Brazilian Ophthalmology Dataset (BRSET) show significant improvements over the baseline models. Our framework achieves near-perfect performance in text-to-image retrieval with a Recall@1 of 99.94% compared to fine-tuned CLIP of 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, the null evaluation on the unseen DeepEyeNet dataset confirms the strong generalizability of 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures multimodal relationships in the medical domain, creating superior retrieval capabilities and robust diagnostic performance.
Submission date
From: Argha Kamal Samanta [view email]
[v1]
Monday, 22 December 2025, 18:41:45 UTC (2,596 KB)
Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!
2025-12-23 05:00:00



