Due to the limitation of single-source data,the use of multi-source remote sensing data for joint classification of earth observations is a promising but challenging method.However,there are some problems such as the gap of imaging mechanism and the imbalance of information between multi-source data,so that the existing methods still have some shortcomings in feature extraction of single-source data and feature fusion of multi-source data.In this paper,we propose a hyperspectral image and synthetic aperture radar image classification method based on multi-scale heterogeneous feature extraction,named multi-scale heterogeneous cross-modal attention network.Specifically,a multi-scale heterogeneous feature extraction module is designed to extract the spatial-spectral joint features of hyperspectral image and synthetic aperture radar images.Depthwise separable convolution is used in the module to reduce the number of parameters,and residual connection is used to enhance the fusion effect of features at different levels.In addition,a cross-modal fusion attention module is designed to further explore the channel features of hyperspectral image and the spatial features of synthetic aperture radar images.To achieve effective alignment and fusion of complementary features,the mapping relationship of cross-modal features at each position is considered.Compared with other methods,MSHCNet achieves at least 2.91%and 6.81%improvement in OA on Augsburg and Berlin datasets,respectively,demonstrating excellent classification performance.