基于成分引导的多模态自蒸馏食品图像分割

Ingredient-guided Multimodal Self-distillation for Food Image Segmentation

侯素娟 ¹孙月娟 ¹闵巍庆 ²王瑞平 ²蒋树强²

扫码查看

作者信息

1. 山东师范大学济南 250358
2. 中国科学院计算技术研究所北京 100190
折叠

摘要

目的:随着计算机视觉技术的发展,精确地识别并分割食品图像中的不同成分区域,对于食品营养分析和促进饮食健康管理至关重要.然而,当前图像分割模型多依赖于单一图像输入,这一做法在处理视觉差异较小的食品图像时,往往难以捕捉到细微的区分特征,从而影响分割精度.本文旨在解决单一模态在分割任务中的不足,利用文本信息为模型提供更加丰富的上、下文信息,采用自蒸馏技术,引导模型对食品图像的有效分割.方法:提出一种基于成分信息引导的多模态自蒸馏分割模型.该模型采用对比语言文本预训练模型(CLIP)捕捉成分信息,再与图像知识有效融合,结合扩散模型在稠密预测方面的优势,实现对食品图像的精准分割.结果:在基准数据集FoodSeg103上验证,所提模型的评估指标mloU达到47.93％,超越了当前最优的FoodSAM模型1.51个百分点.在基准数据集UEC-FoodPIX Complete上,模型的评估指标mloU达到75.13％,比FoodSAM模型高8.99个百分点.结论:所提出的多模态自蒸馏网络在食品图像分割任务中表现出色,验证了成分信息对分割任务的有效指导作用,提升了分割精度,为食品图像分析提供了新的解决方案.

Abstract

Objectives:With advancements in computer vision technology,accurately identifying and segmenting various components in food images has become essential for food nutrition analysis and promoting healthier diet management.However,most existing image segmentation models rely solely on a single image input,which often struggles to capture subtle distinguishing features in food images with minimal visual differences,ultimately impacting segmentation accuracy.This paper addressed the limitations of single-modality approaches in segmentation tasks by incorporating text information to provide richer contextual data for the model.Additionally,it leveraged self-distillation techniques to guide the model in effectively segmenting food images.Methods:This paper proposed a multi-modal self-distillation segmentation model guided by ingredient information to improve food image segmentation.The model leveraged the comparative languaged pre-training model(CLIP)to capture ingredient information and fused it with image knowledge.By combining the strengths of the diffusion model in dense prediction,the model achieved accurate segmentation of food images.Results:When evaluated on the benchmark dataset FoodSeg103,the model achieved an mloU of 47.93％,surpassing the current best-performing FoodSAM model by 1.51％.On the UEC-FoodPIX Complete benchmark dataset,the mIoU reached 75.13％,outperforming the FoodSAM model by 8.99％.Conclusions:The proposed multi-modal self-distillation network demonstrated strong performance in food image segmentation,showcasing the effective role of ingredient information in guiding segmentation tasks.This approach significantly improves segmentation accuracy and presents a promising solution for food image analysis.

关键词

食品图像/图像分割/多模态/自蒸馏

Key words

food image/image segmentation/multimodal/self-distillation

引用本文复制引用

出版年

2024

中国食品学报

中国食品科学技术学会

中国食品学报

CSTPCDCSCD北大核心EI

影响因子：1.079

ISSN：1009-7848

段落导航