英文论文
-
文献类型
-
Journal article (JA)
-
题名
-
M<SUP>3</SUP>ixup: 3 ixup: A multi-modal data augmentation approach for image captioning
-
作者
-
Li, Yinan; Ji, Jiayi; Sun, Xiaoshuai; Zhou, Yiyi; Luo, Yunpeng; Ji, Rongrong
-
作者单位
-
[Li, Yinan; Ji, Jiayi; Sun, Xiaoshuai; Zhou, Yiyi; Luo, Yunpeng; Ji, Rongrong] Xiamen Univ, Key Lab Multimedia Trusted Percept & Efficient Com, Minist Educ China, Xiamen 361005, Peoples R China.
-
通讯作者地址
-
Xiamen Univ, Sch Informat, Room 110-2,Bldg 5, Xiamen, Peoples R China.
-
Email
-
xssun@xmu.edu.cn
-
ResearchID
-
-
ORCID
-
-
期刊名称
-
Pattern Recognition
-
出版社
-
ELSEVIER SCI LTD
-
ISSN
-
0031-3203
-
出版信息
-
2025-02, 158 ():.
-
JCR
-
-
影响因子
-
-
ISBN
-
-
基金
-
National Key R&D Program of China [2023YFB4502804]; National Science Fund for Distinguished Young Scholars [62025603]; National Natural Science Foundation of China [U21B2037, U22B2051, 62072389, 62302411, U21A20472]; China Postdoctoral Science Foundation [2023M732948]; Natural Science Foundation of Fujian Province of China [2021J01002, 2022J06001]
-
会议名称
-
-
会议地点
-
-
会议开始日期
-
-
会议结束日期
-
-
关键词
-
Image captioning; Multi-modal mixup; Data augmentation; Discriminate captioning
-
摘要
-
Despite the great success, most models in image captioning (IC) are still stuck in the dilemma of generating simple and non-discriminative captions. In this paper, we study this problem from the perspective of data augmentation and propose a novel method called Multi-modal Mixup (M(3)ixup). Compared with the original Mixup strategy designed for image classification, the proposed M(3)ixup has three novel designs to mix IC samples from the aspects of visual features, sentence embeddings and loss values, respectively. In practice, M(3)ixup can not only enrich the diversity of IC training data, but also enforce the model to focus more on visual information for captioning, thereby alleviating the negative effect of dataset bias and addressing the issue of simple captioning. To validate M(3)ixup, we apply it to three baseline models and conduct extensive experiments on MS COCO. The experimental results demonstrate that our proposed M(3)ixup can not only improve the discriminability and quality of generated captions, but also help the baseline models obtain obvious performance gains, i.e., improving the CIDEr scores of the state-of-the-art model from 133.8 to 135.3 on off-line testing and 135.4 to 137.1 on online testing.
-
一级学科
-
Computer Science, Artificial Intelligence; Engineering, Electrical & Electronic
-
WOS入藏号
-
WOS:001312631400001
-
EI收录号
-
20243617005938
-
DOI
-
10.1016/j.patcog.2024.110941
-
ESI
-
-
收录于
-
SCIE, EI
-
返回