面向弱纹理目标立体匹配的Transformer网络

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：目的近年来,采用神经网络完成立体匹配任务已成为计算机视觉领域的研究热点,目前现有方法存在弱纹理目标缺乏全局表征的问题,为此本文提出一种基于Transformer架构的密集特征提取网络.方法首先,采用空间池化窗口策略使得Transformer层可以在维持线性计算复杂度的同时,捕获广泛的上下文表示,弥补局部弱纹理导致的特征匮乏问题.其次,通过卷积与转置卷积实现重叠式块嵌入,使得所有特征点都尽可能多地捕捉邻近特征,便于细粒度匹配.再者,通过将跳跃查询策略应用于编码器和解码器间的特征融合部分,以此实现高效信息传递.最后,针对立体像对存在的遮挡情况,对固定区域内的匹配概率进行截断求和,输出更为合理的遮挡置信度.结果在Scene Flow数据集上进行了消融实验,实验结果表明,本文网络获得了 0.33的绝对像素距离,0.92％的异常像素占比和98％的遮挡预测交并比.为了验证模型在实际路况场景下的有效性,在KITTI-2015数据集上进行了补充对比实验,本文方法获得了 1.78％的平均异常值百分比,上述指标均优于STTR(stereo Transformer)等主流方法.此外,在 KITTI-2015、MPI-Sintel(max planck institute sintel)和 Middlebury-2014 数据集的测试中,本文模型具备较强的泛化性.结论本文提出了一个纯粹的基于Transformer架构的密集特征提取器,使用空间池化窗口策略减小注意力计算的空间规模,并利用跳跃查询策略对编码器和解码器的特征进行了有效融合,可以较好地提高Trans-former 架构下的特征提取性能.

外文标题：Transformer network for stereo matching of weak texture objects

外文摘要：Objective In recent years,the use of neural networks for stereo matching tasks has become a major topic in the field of computer vision.Stereo matching is a classic and computationally intensive task in computer vision.It is commonly used in various advanced visual processing applications such as 3D reconstruction,autonomous driving,and augmented reality.Given a pair of distortion-corrected stereo images,the goal of stereo matching is to match corresponding pixels along the epipolar lines and compute the horizontal disparity,also known as disparity.In recent years,many researchers have explored deep learning-based stereo matching methods,which achieving promising results.Convolutional neural net-works are often used to construct feature extractors for stereo matching.Although convolution-based feature extractors have yielded significant improvements in performance,neural networks are still constrained by the fundamental operation unit of"convolution".By definition,convolution is a linear operator with a limited receptive field.Achieving sufficiently broad contextual representation requires stacking layers of convolutions in deep architectures.This limitation becomes particu-larly pronounced in stereo matching tasks.In stereo matching tasks,captured stereo image pairs inevitably contain large areas of weak texture.Substantial computational resources are required to obtain comprehensive global feature representa-tions through repeated convolutional layer stacking.We build a dense feature extraction Transformer for the stereo match-ing tasks,which incorporates Transformer and convolution blocks,to address the abovementioned issue.Method In the context of stereo matching tasks,FET exhibits three key advantages.First,by addressing high-resolution stereo image pairs,the inclusion of a pyramid pooling window within the Transformer block allows us to maintain linear computational complexity while obtaining a sufficiently broad context representation.This way addresses the issue of feature scarcity caused by local weak textures.Second,we utilize convolution and transposed convolution blocks for implementing subsam-pling and upsampling overlapping patch embeddings,which ensures that all points nearby features are captured as compre-hensively as possible to facilitate fine-grained matching.Third,we experiment with employing a skip-query strategy for fea-ture fusion between the encoder and decoder to efficiently transmit information.Finally,we incorporate the attention-based pixel matching strategy of stereo Transformer(STTR)to realize a purely Transformer-based architecture.This strategy trun-cates the summation of matching probabilities within fixed regions to output more reasonable occlusion confidence values.Result In the experimental section,we implemented our model using the PyTorch framework and trained it on an NVIDIA GTX 3090.We employed mixed precision during the training process to reduce GPU memory consumption and improve training speed.However,training a pure Transformer architecture in mixed precision proved to be unstable.The model experienced loss divergence errors after only a few iterations.We modified the order of computation for attention scores to suppress related overflows for addressing this issue.We also restructured the attention calculation method based on the additivity invariance of the softmax operation.Ablation experiments were conducted on the Scene Flow dataset.Results show that the proposed network achieves an absolute pixel distance of 0.33,an outlier pixel ratio of 0.92％,and a 98％overlap prediction intersection over union.Additional comparative experiments were conducted on the KITTI-2015 dataset to validate the effectiveness of the model in real-world driving scenarios.In these experiments,the proposed method achieved an average outlier percentage of 1.78,which outperformed mainstream methods such as STTR.Moreover,in tests on the KITTI-2015,MPI-Sintel,and Middlebury-2014 datasets,the proposed model demonstrated strong generaliza-tion capabilities.Subsequently,considering the limited definition of weak texture levels in currently available public data-sets,we employed a clustering approach to filter images from the Scene Flow test dataset.Each pixel in the images was treated as a sample,with RGB values serving as the feature dimensions.This clustering process resulted in quantifying the number of different pixel categories within each image,which provided a measure of the texture strength or weakness in the images.The images were then categorized into"difficult","moderate",and"easy"cases based on the number of clusters.Through comparative analysis,our approach consistently outperformed existing methods across the three sample catego-ries,with a particularly notable improvement observed in the"difficult"case category.Conclusion For the stereo matching task,we propose a feature extractor based on the Transformer architecture.First,we transplant the architecture of the encoder and decoder of the Transformer into the feature extractor,which effectively combines the inductive bias of convolu-tions with the global modeling capabilities of the Transformer.In addition,the Transformer-based feature extractor can cap-ture a broader range of contextual representations,which partially alleviates region ambiguity issues caused by local weak textures.Furthermore,we introduce a skip-query strategy between the encoder and decoder to achieve efficient information transfer,which mitigates semantic discrepancies between them.We also design a spatial pooling window strategy to reduce the significant computational burden resulting from overlapping block embeddings,which keeps the attention computation of the model within linear complexity.Experimental results demonstrate a significant improvement in weak texture region prediction,occluded region prediction,and domain generalization when compared with relevant methods.

外文关键词：

stereo matchinglow-texture targetTransformerspatial pooling windowsjump queriestruncated summa-tionScene FlowKITTI-2015

作者：

贾迪、蔡鹏、吴思、王骞、宋慧伦

展开 >

作者单位：

辽宁工程技术大学电子与信息工程学院,葫芦岛 125105

国网葫芦岛供电公司,葫芦岛 125000

关键词：

立体匹配弱纹理目标 Transformer 空间池化窗口跳跃查询截断求和 Scene Flow KITTI-2015

基金：

国家自然科学基金辽宁省教育厅项目

项目编号：

61601213LJ2020FWL004

出版年：

2024

DOI：

10.11834/jig.230575

中国图象图形学报

中国科学院遥感应用研究所,中国图象图形学学会 ,北京应用物理与计算数学研究所

中国图象图形学报

CSTPCD北大核心

影响因子：1.111

ISSN：1006-8961

年,卷(期)：2024.29(8)