Deep embedded Transformer network with spatial-spectral information for unmixing of hyperspectral remote sensing images
Objective In hyperspectral remote sensing,mixed pixels often exist due to the complex surface of natural objects and the limitation of spatial resolution of instruments.Mixed pixels typically refer to the situation where a pixel in the hyperspectral images usually contains multiple spectral features,which hinders the application of hyperspectral images in various fields such as target detection,image classification,and environmental monitoring.Therefore,the decomposi-tion(unmixing)of mixed pixels is a main concern in the processing of hyperspectral remote sensing images.Spectral unmixing aims to overcome the limitations of image spatial resolution by extracting pure spectral signals(endmembers)rep-resenting each land cover class and their respective proportions(abundances)within each pixel.It is based on a spectral mixing model at the sub-pixel level.The rise of deep learning has brought many advanced modeling theories and architec-ture tools to the field of hyperspectral mixed pixel decomposition and has also spawned many deep learning-based unmixing methods.Although these methods have advantages over traditional methods in terms of information mining and generaliza-tion performance,deep networks often need to combine multiple layers of stacked network layers to achieve optimal learn-ing outcomes.Therefore,deep networks may cause damage to the internal structure of the data during the training process,which leads to the loss of important information in hyperspectral data and affects the accuracy of unmixing.In addition,most existing deep learning-based unmixing methods focus only on spectral information,but the exploit of spatial informa-tion is still limited to surface processing stages such as filtering and convolution.In recent years,autoencoder has been one of the research hotspots in the field of deep learning,and many variant networks based on autoencoder networks have emerged.Transformer is a novel deep learning network with an autoencoder-like structure.It has garnered considerable attention in various fields such as natural language processing,computer vision,and time series analysis due to its powerful feature representation capability.The Transformer,as a neural network primarily based on the self-attention mechanism,can better explore the underlying relationships among different features and more comprehensively aggregate the spectral and spatial correlations of pixels.This way enhances the ability of abundance learning and improves the accuracy of unmix-ing.Although the Transformer network has recently been used to design unmixing methods,using unsupervised Trans-former models directly to obtain features can lose many local details and cause difficulty in exploiting the long-range depen-dency properties of Transformers effectively.Method To address these limitations,the study proposes a deep embedded Transformer network(DETN)based on the Transformer-in-Transformer architecture.This network adopts an autoencoder framework that consists of two main parts:node embedding(NE)and blind signal separation.In the first part,the input hyperspectral image is first uniformly divided twice,and the divided image patches are mapped into sub-patch sequences and patch sequences through linear transformation operations.Then,the sub-patch sequences are processed through an internal Transformer structure to obtain pixel spectral information and local spatial correlations,which are then aggregated into the patch sequences for parameter and information sharing.Finally,the local detail information in the patch sequences is retained,and the patch sequences are processed through an external Transformer structure to obtain and output pixel spectral information and global spatial correlation information containing local information.In the second part,the input NE is first reconstructed into an abundance map and smoothed during this process using a single layer of 2D convolution layer to eliminate noise.A SoftMax layer is used to ensure the physical meaning of the abundances.Finally,a single-layer 2D convolution layer is used to reconstruct the image,which optimizes and estimates the endmembers in the convolution layer.Result To evaluate the effectiveness of the proposed method,experiments are conducted using simulated datasets and some real hyperspectral datasets,including the Samson dataset,the Jasper Ridge dataset,and a part of the real hyper-spectral farmland data in Nanchang City,Jiangxi Province,obtained by the Gaofen-5 satellite provided by Beijing Shengshi Huayao Technology Co.,Ltd.In addition,resources from the ZY1E satellite provided by Beijing Shengshi Huayao Tech-nology Co.,Ltd.are used to obtain partial hyperspectral data of the Marseille Port in France for comparative experiments with different methods.The experimental results are quantitatively analyzed using spectral angle distance(SAD)and root mean square error(RMSE).In addition,the method evaluates the proposed DETN compared with several state-of-the-art deep learning-based unmixing algorithms:fully strained least squares(FCLS),deep autoencoder networks for hyperspec-tral unmixing(DAEN),autoencoder network for hyperspectral unmixing with adaptive abundance smoothing(AAS),the untied denoising autoencoder with sparsity(uDAS),hyperspectral unmixing using deep imageprior(UnDIP),and hyper-spectral unmixing using Transformer network(DeepTrans-HSU).Results demonstrate that the proposed method outper-forms the compared methods in terms of spectral angle distance(SAD),root mean square error(RMSE),and other evalua-tion metrics.Conclusion The proposed method effectively captures and preserves the spectral information of pixels at local and global levels,as well as the spatial correlations among pixels.This method results in accurate extraction of endmembers that match the ground truth spectral features.Moreover,the method produces smooth abundance maps with high spatial consistency,even in regions with hidden details in the image.These findings validate that the DETN method provides new technical support and theoretical references for addressing the challenges posed by mixed pixels in hyperspectral image unmixing.