Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

扫码查看

原文链接

Arxiv

外文摘要：Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

作者：

Jinan Xu、Jing Gao、You Li、Wenjuan Han、Ning Cheng、Bin Fang

展开 >

学科分类：

自动化技术、计算机技术(计算技术、计算机技术)

首发时间：2024-03-14