Visual Language Understanding in Construction Process Based on Visual Question Answering
The widespread application of monitoring systems on construction sites has generated abundant data,however,constrained by limited analysis methods,the value of these data remains inadequately realized.Natural language,as the most direct method of expression,is the most convenient and understandable in construction management.Utilizing multi-modal visual language models to implement construction natural language question answering greatly assists in obtaining construction site information and achieving intelligent construction management,however,current multi-modal research on construction sites remains insufficient.This paper establishes a construction visual question answering dataset comprising over 19 000 Q&A pairs with corresponding images for training visual question answering models applicable to construction sites.Furthermore,a construction question answering model based on multi-modal attention mechanisms and pre-trained visual Transformer is proposed,this model achieves an accuracy of 79.3%on the test set,indicating significant potential of multi-modal visual language understanding in acquiring construction information,provide an effective information foundation for intelligent construction management.