引用本文: | 陈巧红,项深祥,方贤,孙麒.跨模态自适应特征融合的视觉问答方法[J].哈尔滨工业大学学报,2025,57(4):94.DOI:10.11918/202404002 |
| CHEN Qiaohong,XIANG Shenxiang,FANG Xian,SUN Qi.Visual question answering method based on cross-modal adaptive feature fusion[J].Journal of Harbin Institute of Technology,2025,57(4):94.DOI:10.11918/202404002 |
|
摘要: |
为提高视觉问答(VQA)中跨模态融合与交互的精确度,减少多模态特征信息的丢失,提出了一种新颖的基于跨模态自适应特征融合的视觉问答方法。首先,该方法设计了卷积自注意力单元,包含自注意力层和空洞卷积层,前者用于捕捉全局特征信息,后者用于捕捉视觉对象间的空间关系。其次,通过自适应特征融合层,将全局关系与空间关系进行有效结合,使模型在处理图像特征时能够同时考虑全局关系和视觉对象之间的关联性,从而克服了传统注意力机制忽视空间关系的问题。最后,基于不同模态特征在答案预测中贡献程度的差异,该方法还构建了多模态门控融合模块,根据多模态特征间的重要程度自适应地融合特征,减少多模态信息的丢失,同时不会带来额外的计算资源开销。研究结果表明,该方法在未使用额外数据集预训练的情况下,在VQA2.0的测试-开发集、测试-标准集和GQA数据集上的整体准确率分别达到71.58%、72.00%、58.14%, 显著优于传统自注意力方法,该研究成果可为跨模态特征融合领域提供了重要的参考和借鉴。 |
关键词: 视觉问答(VQA) 特征融合 多模态 注意力机制 门控机制 |
DOI:10.11918/202404002 |
分类号:TP391.41; TP391.1 |
文献标识码:A |
基金项目:浙江省自然科学基金(LQ23F020021) |
|
Visual question answering method based on cross-modal adaptive feature fusion |
CHEN Qiaohong,XIANG Shenxiang,FANG Xian,SUN Qi
|
(School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China)
|
Abstract: |
To enhance the accuracy of cross-modal fusion and interaction in visual question answering (VQA) while mitigating the loss of multimodal feature information, we propose a novel cross-modal adaptive feature fusion approach for VQA. First, the method designs a convolutional self-attention unit consisting of self-attention layers and dilated convolution layers-the former captures global feature information while the latter extracts spatial relationships between visual objects. Subsequently, an adaptive feature fusion layer effectively integrates global relationships with spatial correlations, enabling the model to simultaneously consider both global contextual information and inter-object spatial relationships during image feature processing, thereby addressing the limitation of traditional attention mechanisms in overlooking spatial relationships. Furthermore, based on the varying contributions of different modal features to answer prediction, we construct a multimodal gated fusion module that adaptively combines features according to their relative importance, effectively reducing information loss across modalities without introducing additional computational overhead. Experimental results demonstrate that our method achieves overall accuracies of 71.58%, 72.00%, and 58.14% on the VQA2.0 test-dev, test-std, and GQA datasets respectively, significantly outperforming traditional self-attention approaches without requiring additional pre-training datasets. This research provides valuable insights and serves as an important reference for cross-modal feature fusion studies. |
Key words: visual question answering(VQA) feature fusion multimodal attentional mechanisms gating mechanisms |