Visual question answering method based on cross-modal adaptive feature fusion

doi:10.11918/202404002

Home > Archive>Volume 57, Issue 4, 2025 >94-104. DOI:10.11918/202404002

Visual question answering method based on cross-modal adaptive feature fusion
DOI:
                        10.11918/202404002
                    
CSTR:
                        
Author:
                        
Affiliation:(School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China)
Clc Number:TP391.41; TP391.1
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

To enhance the accuracy of cross-modal fusion and interaction in visual question answering (VQA) while mitigating the loss of multimodal feature information, we propose a novel cross-modal adaptive feature fusion approach for VQA. First, the method designs a convolutional self-attention unit consisting of self-attention layers and dilated convolution layers-the former captures global feature information while the latter extracts spatial relationships between visual objects. Subsequently, an adaptive feature fusion layer effectively integrates global relationships with spatial correlations, enabling the model to simultaneously consider both global contextual information and inter-object spatial relationships during image feature processing, thereby addressing the limitation of traditional attention mechanisms in overlooking spatial relationships. Furthermore, based on the varying contributions of different modal features to answer prediction, we construct a multimodal gated fusion module that adaptively combines features according to their relative importance, effectively reducing information loss across modalities without introducing additional computational overhead. Experimental results demonstrate that our method achieves overall accuracies of 71.58%, 72.00%, and 58.14% on the VQA2.0 test-dev, test-std, and GQA datasets respectively, significantly outperforming traditional self-attention approaches without requiring additional pre-training datasets. This research provides valuable insights and serves as an important reference for cross-modal feature fusion studies.

Reference

Cited by

Get Citation

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:April 01,2024
Revised:
Adopted:
Online: April 07,2025
Published:

Publication Statement

Journal Subscription

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code