跨模态自适应特征融合的视觉问答方法
CSTR:
作者:
作者单位:

(浙江理工大学 计算机科学与技术学院,杭州 310018)

作者简介:

陈巧红(1978—),女,博士,教授

通讯作者:

方贤,xianfang@zstu.edu.cn

中图分类号:

TP391.41; TP391.1

基金项目:

浙江省自然科学基金(LQ23F020021)


Visual question answering method based on cross-modal adaptive feature fusion
Author:
Affiliation:

(School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China)

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    为提高视觉问答(VQA)中跨模态融合与交互的精确度,减少多模态特征信息的丢失,提出了一种新颖的基于跨模态自适应特征融合的视觉问答方法。首先,该方法设计了卷积自注意力单元,包含自注意力层和空洞卷积层,前者用于捕捉全局特征信息,后者用于捕捉视觉对象间的空间关系。其次,通过自适应特征融合层,将全局关系与空间关系进行有效结合,使模型在处理图像特征时能够同时考虑全局关系和视觉对象之间的关联性,从而克服了传统注意力机制忽视空间关系的问题。最后,基于不同模态特征在答案预测中贡献程度的差异,该方法还构建了多模态门控融合模块,根据多模态特征间的重要程度自适应地融合特征,减少多模态信息的丢失,同时不会带来额外的计算资源开销。研究结果表明,该方法在未使用额外数据集预训练的情况下,在VQA2.0的测试-开发集、测试-标准集和GQA数据集上的整体准确率分别达到71.58%、72.00%、58.14%, 显著优于传统自注意力方法,该研究成果可为跨模态特征融合领域提供了重要的参考和借鉴。

    Abstract:

    To enhance the accuracy of cross-modal fusion and interaction in visual question answering (VQA) while mitigating the loss of multimodal feature information, we propose a novel cross-modal adaptive feature fusion approach for VQA. First, the method designs a convolutional self-attention unit consisting of self-attention layers and dilated convolution layers-the former captures global feature information while the latter extracts spatial relationships between visual objects. Subsequently, an adaptive feature fusion layer effectively integrates global relationships with spatial correlations, enabling the model to simultaneously consider both global contextual information and inter-object spatial relationships during image feature processing, thereby addressing the limitation of traditional attention mechanisms in overlooking spatial relationships. Furthermore, based on the varying contributions of different modal features to answer prediction, we construct a multimodal gated fusion module that adaptively combines features according to their relative importance, effectively reducing information loss across modalities without introducing additional computational overhead. Experimental results demonstrate that our method achieves overall accuracies of 71.58%, 72.00%, and 58.14% on the VQA2.0 test-dev, test-std, and GQA datasets respectively, significantly outperforming traditional self-attention approaches without requiring additional pre-training datasets. This research provides valuable insights and serves as an important reference for cross-modal feature fusion studies.

    参考文献
    相似文献
    引证文献
引用本文

陈巧红,项深祥,方贤,孙麒.跨模态自适应特征融合的视觉问答方法[J].哈尔滨工业大学学报,2025,57(4):94. DOI:10.11918/202404002

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-04-01
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-04-07
  • 出版日期:
文章二维码