Most existing text-to-image person re-identification methods adapt to person re-identification tasks and obtain strong visual language joint representation capabilities of pre-trained models by fine-tuning visual language models,such as contrastive language-image pretraining(CLIP).These methods only consider the task adaptation for downstream re-identification task,but they ignore the required data adaptation due to data differences and it is still difficult for them to effectively capture structured knowledge,such as understanding object attributes and relationships between objects.To solve these problems,a scene graph knowledge based text-to-image person re-identification method is proposed.A two-stage training strategy is employed.In the first stage,the image encoder and the text encoder of CLIP model are frozen.Prompt learning is utilized to optimize the learnable prompt tokens to make the downstream data domain adapt to the original training data domain of CLIP model.Thus,the domain adaptation problem is effectively solved.In the second stage,while fine-tuning CLIP model,semantic negative sampling and scene graph encoder modules are introduced.First,difficult samples with similar semantics are generated by scene graph,and the triplet loss is introduced as an additional optimization target.Second,the scene graph encoder is introduced to take the scene graph as input,enhancing CLIP ability to acquire structured knowledge in the second stage.The effectiveness of the proposed method is verified on three widely used datasets.
关键词
场景图/提示学习/文本到图像的行人重识别(T2IReID)/CLIP
Key words
Scene Graph/Prompt Learning/Text-to-Image Person Re-identification(T2IReID)/Cont-rastive Language-Image Pretraining(CLIP)