首页|基于场景图知识的文本到图像行人重识别

基于场景图知识的文本到图像行人重识别

扫码查看
现有的大多数文本到图像的行人重识别方法对CLIP(Contrastive Language-Image Pretraining)等视觉语言模型进行微调以适应行人重识别任务,并获得预训练模型的强大视觉语言联合表征能力,然而,这些方法通常只考虑对下游重识别任务的任务适应,却忽视由于数据差异所需的数据域适应,难以有效捕获结构化知识(理解对象属性及对象间关系).针对这些问题,基于CLIP-ReID,文中提出基于场景图知识的文本到图像行人重识别方法,采用两阶段训练策略.在第一阶段,冻结CLIP的图像编码器和文本编码器,利用提示学习优化可学习提示词,实现下游数据域与CLIP原始训练数据域的适配,解决数据域适应的问题.在第二阶段,微调CLIP的同时引入语义负采样和场景图编码器模块,先通过场景图生成语义相近的难样本,并引入三元组损失作为额外优化目标,再引入场景图编码器,将场景图作为输入,增强CLIP在第二阶段对结构化知识的获取能力.在3个广泛使用的数据集上验证文中方法的有效性.
Scene Graph Knowledge Based Text-to-Image Person Re-identification
Most existing text-to-image person re-identification methods adapt to person re-identification tasks and obtain strong visual language joint representation capabilities of pre-trained models by fine-tuning visual language models,such as contrastive language-image pretraining(CLIP).These methods only consider the task adaptation for downstream re-identification task,but they ignore the required data adaptation due to data differences and it is still difficult for them to effectively capture structured knowledge,such as understanding object attributes and relationships between objects.To solve these problems,a scene graph knowledge based text-to-image person re-identification method is proposed.A two-stage training strategy is employed.In the first stage,the image encoder and the text encoder of CLIP model are frozen.Prompt learning is utilized to optimize the learnable prompt tokens to make the downstream data domain adapt to the original training data domain of CLIP model.Thus,the domain adaptation problem is effectively solved.In the second stage,while fine-tuning CLIP model,semantic negative sampling and scene graph encoder modules are introduced.First,difficult samples with similar semantics are generated by scene graph,and the triplet loss is introduced as an additional optimization target.Second,the scene graph encoder is introduced to take the scene graph as input,enhancing CLIP ability to acquire structured knowledge in the second stage.The effectiveness of the proposed method is verified on three widely used datasets.

Scene GraphPrompt LearningText-to-Image Person Re-identification(T2IReID)Cont-rastive Language-Image Pretraining(CLIP)

王晋溪、鲁鸣鸣

展开 >

中南大学计算机学院 长沙 410083

场景图 提示学习 文本到图像的行人重识别(T2IReID) CLIP

2024

模式识别与人工智能
中国自动化学会,国家智能计算机研究开发中心,中国科学院合肥智能机械研究所

模式识别与人工智能

CSTPCD北大核心
影响因子:0.954
ISSN:1003-6059
年,卷(期):2024.37(11)