英文论文


文献类型
Journal article (JA)
题名
M<SUP>3</SUP>ixup: 3 ixup: A multi-modal data augmentation approach for image captioning
作者
Li, Yinan; Ji, Jiayi; Sun, Xiaoshuai; Zhou, Yiyi; Luo, Yunpeng; Ji, Rongrong
作者单位
[Li, Yinan; Ji, Jiayi; Sun, Xiaoshuai; Zhou, Yiyi; Luo, Yunpeng; Ji, Rongrong] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China.
通讯作者地址
Xiamen Univ, Sch Informat, Room 110-2,Bldg 5, Xiamen, Peoples R China.
Email
xssun@xmu.edu.cn
ResearchID
ORCID
期刊名称
Pattern Recognition
出版社
ELSEVIER SCI LTD
ISSN
0031-3203
出版信息
2025-02, 158 ():.
JCR
影响因子
ISBN
基金
National Key R&D Program of China [2023YFB4502804]; National Science Fund for Distinguished Young Scholars [62025603]; National Natural Science Foundation of China [U21B2037, U22B2051, 62072389, 62302411, U21A20472]; China Postdoctoral Science Foundation [2023M732948]; Natural Science Foundation of Fujian Province of China [2021J01002, 2022J06001]
会议名称
会议地点
会议开始日期
会议结束日期
关键词
Image captioning; Multi-modal mixup; Data augmentation; Discriminate captioning
摘要
Despite the great success, most models in image captioning (IC) are still stuck in the dilemma of generating simple and non-discriminative captions. In this paper, we study this problem from the perspective of data augmentation and propose a novel method called Multi-modal Mixup (M(3)ixup). Compared with the original Mixup strategy designed for image classification, the proposed M(3)ixup has three novel designs to mix IC samples from the aspects of visual features, sentence embeddings and loss values, respectively. In practice, M(3)ixup can not only enrich the diversity of IC training data, but also enforce the model to focus more on visual information for captioning, thereby alleviating the negative effect of dataset bias and addressing the issue of simple captioning. To validate M(3)ixup, we apply it to three baseline models and conduct extensive experiments on MS COCO. The experimental results demonstrate that our proposed M(3)ixup can not only improve the discriminability and quality of generated captions, but also help the baseline models obtain obvious performance gains, i.e., improving the CIDEr scores of the state-of-the-art model from 133.8 to 135.3 on off-line testing and 135.4 to 137.1 on online testing.
一级学科
Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic
WOS入藏号
WOS:001312631400001
EI收录号
20243617005938
DOI
10.1016/j.patcog.2024.110941
ESI
收录于
SCIE, EI

返回