A multimodal framework for explainable chest X-ray report generation
Abstract
Chest X-ray (CXR) interpretation remains a challenging task due to overlapping anatomical structures, variability in disease presentation, and increasing clinical workload. Existing automated report-generation models provide promising results but often lack explicit interpretability, limited clinical alignment, and insufficient comparative evaluation with established baselines. This study proposes an explainable multimodal framework that combines a dual CNN encoder (ResNet-50 and EfficientNet-B0) with the Gemma-3 1B language model fine-tuned using low-rank adaptation (LoRA). Visual explanations are produced through Gradient-weighted Class Activation Mapping (Grad-CAM) to enhance transparency in the decision process. Unlike prior image-to-text pipelines, our approach follows a findings-guided paradigm and integrates both visual and textual cues during generation. Experiments conducted on public datasets demonstrate consistent improvements over representative vision-language baselines reported in recent literature, with notable gains in BLEU, ROUGE, METEOR, and BERTScore. Generated reports show improved factual completeness and clinically relevant region-level attention. Limitations include the absence of evaluation against emerging foundation models and the need for anatomical- level explainability metrics. Future work will extend benchmarking to models such as M2-Transformer, MedCLIP-GPT, and R2Gen, and will explore clinical validation in real-world workflows.
Keywords
Full Text:
PDFDOI: http://doi.org/10.11591/ijeecs.v41.i3.pp1060-1069
Refbacks
- There are currently no refbacks.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES).