Visual question answering method based on cross-modal adaptive feature fusion
CSTR:
Author:
Affiliation:

(School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China)

Clc Number:

TP391.41; TP391.1

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    To enhance the accuracy of cross-modal fusion and interaction in visual question answering (VQA) while mitigating the loss of multimodal feature information, we propose a novel cross-modal adaptive feature fusion approach for VQA. First, the method designs a convolutional self-attention unit consisting of self-attention layers and dilated convolution layers-the former captures global feature information while the latter extracts spatial relationships between visual objects. Subsequently, an adaptive feature fusion layer effectively integrates global relationships with spatial correlations, enabling the model to simultaneously consider both global contextual information and inter-object spatial relationships during image feature processing, thereby addressing the limitation of traditional attention mechanisms in overlooking spatial relationships. Furthermore, based on the varying contributions of different modal features to answer prediction, we construct a multimodal gated fusion module that adaptively combines features according to their relative importance, effectively reducing information loss across modalities without introducing additional computational overhead. Experimental results demonstrate that our method achieves overall accuracies of 71.58%, 72.00%, and 58.14% on the VQA2.0 test-dev, test-std, and GQA datasets respectively, significantly outperforming traditional self-attention approaches without requiring additional pre-training datasets. This research provides valuable insights and serves as an important reference for cross-modal feature fusion studies.

    Reference
    Related
    Cited by
Get Citation
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:April 01,2024
  • Revised:
  • Adopted:
  • Online: April 07,2025
  • Published:
Article QR Code