图像文本匹配的自适应离线Quintuplet损失

论文标题

图像文本匹配的自适应离线Quintuplet损失

Adaptive Offline Quintuplet Loss for Image-Text Matching

论文作者

Chen, Tianlang, Deng, Jiajun, Luo, Jiebo

论文摘要

现有的图像文本匹配方法通常利用在线硬否底片利用三重态损失来训练模型。对于训练迷你批次中的每个图像或文本锚，对模型进行了培训，以区分锚定锚与迷你批次的锚点（即在线硬否为负面）。该策略提高了模型发现图像和文本输入之间的细粒度对应关系和不相应的能力。但是，上述方法具有以下缺点：（1）负面选择策略仍然为模型提供了有限的机会，可以从非常难以说明的情况下学习。（2）训练有素的模型从训练集到测试集具有较弱的概括能力。（3）惩罚缺乏对具有不同“硬度”学位的艰难负面的层次结构和适应性。在本文中，我们通过从整个培训组中脱机进行负面调查来提出解决方案。它提供了比在线艰苦的负面负面的“更难”的离线负面因素，以使模型区分模型。基于离线硬性负面的负面负面影响，提出了五重奏损失，以提高模型的概括能力，以区分阳性和负面因素。此外，还创造了一种新颖的损失功能，该功能结合了积极的知识，离线艰难的负面因素和在线硬否底片。它利用脱机艰苦的负面因素作为中介机构根据与锚的距离关系适应性地惩罚他们。我们评估了MS-Coco和FlickR30K数据集的三种最先进的图像文本模型的培训方法。所有模型都可以观察到大大改善的性能，证明了我们方法的有效性和一般性。代码可从https://github.com/sunnychencool/aoq获得。

Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model. For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i.e. online hard negative). This strategy improves the model's capacity to discover fine-grained correspondences and non-correspondences between image and text inputs. However, the above approach has the following drawbacks: (1) the negative selection strategy still provides limited chances for the model to learn from very hard-to-distinguish cases. (2) The trained model has weak generalization capability from the training set to the testing set. (3) The penalty lacks hierarchy and adaptiveness for hard negatives with different "hardness" degrees. In this paper, we propose solutions by sampling negatives offline from the whole training set. It provides "harder" offline negatives than online hard negatives for the model to distinguish. Based on the offline hard negatives, a quintuplet loss is proposed to improve the model's generalization capability to distinguish positives and negatives. In addition, a novel loss function that combines the knowledge of positives, offline hard negatives and online hard negatives is created. It leverages offline hard negatives as the intermediary to adaptively penalize them based on their distance relations to the anchor. We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets. Significant performance improvements are observed for all the models, proving the effectiveness and generality of our approach. Code is available at https://github.com/sunnychencool/AOQ

下载PDF全文

下载文献需遵守相关版权规定

论文标题