Transformer network for stereo matching of weak texture objects
Objective In recent years,the use of neural networks for stereo matching tasks has become a major topic in the field of computer vision.Stereo matching is a classic and computationally intensive task in computer vision.It is commonly used in various advanced visual processing applications such as 3D reconstruction,autonomous driving,and augmented reality.Given a pair of distortion-corrected stereo images,the goal of stereo matching is to match corresponding pixels along the epipolar lines and compute the horizontal disparity,also known as disparity.In recent years,many researchers have explored deep learning-based stereo matching methods,which achieving promising results.Convolutional neural net-works are often used to construct feature extractors for stereo matching.Although convolution-based feature extractors have yielded significant improvements in performance,neural networks are still constrained by the fundamental operation unit of"convolution".By definition,convolution is a linear operator with a limited receptive field.Achieving sufficiently broad contextual representation requires stacking layers of convolutions in deep architectures.This limitation becomes particu-larly pronounced in stereo matching tasks.In stereo matching tasks,captured stereo image pairs inevitably contain large areas of weak texture.Substantial computational resources are required to obtain comprehensive global feature representa-tions through repeated convolutional layer stacking.We build a dense feature extraction Transformer for the stereo match-ing tasks,which incorporates Transformer and convolution blocks,to address the abovementioned issue.Method In the context of stereo matching tasks,FET exhibits three key advantages.First,by addressing high-resolution stereo image pairs,the inclusion of a pyramid pooling window within the Transformer block allows us to maintain linear computational complexity while obtaining a sufficiently broad context representation.This way addresses the issue of feature scarcity caused by local weak textures.Second,we utilize convolution and transposed convolution blocks for implementing subsam-pling and upsampling overlapping patch embeddings,which ensures that all points nearby features are captured as compre-hensively as possible to facilitate fine-grained matching.Third,we experiment with employing a skip-query strategy for fea-ture fusion between the encoder and decoder to efficiently transmit information.Finally,we incorporate the attention-based pixel matching strategy of stereo Transformer(STTR)to realize a purely Transformer-based architecture.This strategy trun-cates the summation of matching probabilities within fixed regions to output more reasonable occlusion confidence values.Result In the experimental section,we implemented our model using the PyTorch framework and trained it on an NVIDIA GTX 3090.We employed mixed precision during the training process to reduce GPU memory consumption and improve training speed.However,training a pure Transformer architecture in mixed precision proved to be unstable.The model experienced loss divergence errors after only a few iterations.We modified the order of computation for attention scores to suppress related overflows for addressing this issue.We also restructured the attention calculation method based on the additivity invariance of the softmax operation.Ablation experiments were conducted on the Scene Flow dataset.Results show that the proposed network achieves an absolute pixel distance of 0.33,an outlier pixel ratio of 0.92%,and a 98%overlap prediction intersection over union.Additional comparative experiments were conducted on the KITTI-2015 dataset to validate the effectiveness of the model in real-world driving scenarios.In these experiments,the proposed method achieved an average outlier percentage of 1.78,which outperformed mainstream methods such as STTR.Moreover,in tests on the KITTI-2015,MPI-Sintel,and Middlebury-2014 datasets,the proposed model demonstrated strong generaliza-tion capabilities.Subsequently,considering the limited definition of weak texture levels in currently available public data-sets,we employed a clustering approach to filter images from the Scene Flow test dataset.Each pixel in the images was treated as a sample,with RGB values serving as the feature dimensions.This clustering process resulted in quantifying the number of different pixel categories within each image,which provided a measure of the texture strength or weakness in the images.The images were then categorized into"difficult","moderate",and"easy"cases based on the number of clusters.Through comparative analysis,our approach consistently outperformed existing methods across the three sample catego-ries,with a particularly notable improvement observed in the"difficult"case category.Conclusion For the stereo matching task,we propose a feature extractor based on the Transformer architecture.First,we transplant the architecture of the encoder and decoder of the Transformer into the feature extractor,which effectively combines the inductive bias of convolu-tions with the global modeling capabilities of the Transformer.In addition,the Transformer-based feature extractor can cap-ture a broader range of contextual representations,which partially alleviates region ambiguity issues caused by local weak textures.Furthermore,we introduce a skip-query strategy between the encoder and decoder to achieve efficient information transfer,which mitigates semantic discrepancies between them.We also design a spatial pooling window strategy to reduce the significant computational burden resulting from overlapping block embeddings,which keeps the attention computation of the model within linear complexity.Experimental results demonstrate a significant improvement in weak texture region prediction,occluded region prediction,and domain generalization when compared with relevant methods.