Scene Graph Knowledge Based Text-to-Image Person Re-identification
Most existing text-to-image person re-identification methods adapt to person re-identification tasks and obtain strong visual language joint representation capabilities of pre-trained models by fine-tuning visual language models,such as contrastive language-image pretraining(CLIP).These methods only consider the task adaptation for downstream re-identification task,but they ignore the required data adaptation due to data differences and it is still difficult for them to effectively capture structured knowledge,such as understanding object attributes and relationships between objects.To solve these problems,a scene graph knowledge based text-to-image person re-identification method is proposed.A two-stage training strategy is employed.In the first stage,the image encoder and the text encoder of CLIP model are frozen.Prompt learning is utilized to optimize the learnable prompt tokens to make the downstream data domain adapt to the original training data domain of CLIP model.Thus,the domain adaptation problem is effectively solved.In the second stage,while fine-tuning CLIP model,semantic negative sampling and scene graph encoder modules are introduced.First,difficult samples with similar semantics are generated by scene graph,and the triplet loss is introduced as an additional optimization target.Second,the scene graph encoder is introduced to take the scene graph as input,enhancing CLIP ability to acquire structured knowledge in the second stage.The effectiveness of the proposed method is verified on three widely used datasets.
Scene GraphPrompt LearningText-to-Image Person Re-identification(T2IReID)Cont-rastive Language-Image Pretraining(CLIP)