首页|基于视觉问答的施工过程视觉语言理解

基于视觉问答的施工过程视觉语言理解

扫码查看
监控系统在施工现场的广泛应用创造大量数据,但受限于有限的分析方法,未能充分体现这些数据的信息价值.自然语言是最直接的表达方式,在施工管理中最便于使用和理解.使用多模态视觉语言模型,对通过施工自然语言问答获取施工现场信息、进行施工智能管理有巨大帮助,然而目前针对施工现场的多模态研究仍然不足.为此,建立施工视觉问答数据集,经过数据增强后,包含超过19 000条问答对及对应图像,用于训练适用于施工现场的视觉问答模型.提出基于多头注意力机制及预训练视觉Transformer的施工问答模型,该模型在测试集中取得约79.3%的准确率,表明多模态视觉语言理解在获取施工信息层面具有巨大潜力,可为施工智能管理提供有效的信息基础.
Visual Language Understanding in Construction Process Based on Visual Question Answering
The widespread application of monitoring systems on construction sites has generated abundant data,however,constrained by limited analysis methods,the value of these data remains inadequately realized.Natural language,as the most direct method of expression,is the most convenient and understandable in construction management.Utilizing multi-modal visual language models to implement construction natural language question answering greatly assists in obtaining construction site information and achieving intelligent construction management,however,current multi-modal research on construction sites remains insufficient.This paper establishes a construction visual question answering dataset comprising over 19 000 Q&A pairs with corresponding images for training visual question answering models applicable to construction sites.Furthermore,a construction question answering model based on multi-modal attention mechanisms and pre-trained visual Transformer is proposed,this model achieves an accuracy of 79.3%on the test set,indicating significant potential of multi-modal visual language understanding in acquiring construction information,provide an effective information foundation for intelligent construction management.

visual question answeringcomputer visionnatural languagemulti-modaldeep learningmanagement

张冰涵、杨彬、张其林

展开 >

同济大学土木工程学院,上海 200092

视觉问答 计算机视觉 自然语言 多模态 深度学习 管理

国家重点研发计划

2022YFC3801702

2024

施工技术(中英文)
亚太建设科技信息研究院 中国建筑设计研究院 中国建筑工程总公司 中国土木工程学会

施工技术(中英文)

影响因子:1.244
ISSN:2097-0897
年,卷(期):2024.53(17)