In the field of underwater acoustic target recognition,existing methods primarily utilize time-domain or time-frequency domain techniques,which are highly sensitive to environmental noise and interference.This sensitivity is particularly problematic in complex multipath underwater environments,where signals are prone to disturbances,and reliance solely on time-domain and time-frequency domain features often fails to accurately describe the critical attributes of complex or similar targets.To address this,we propose a Multimodal Cross-Feature Fusion Network(MCFNet)for underwater acoustic target recognition.Firstly,coarse extraction of underwater acoustic data is performed using a one-dimensional sequence feature extraction module and a two-dimensional image feature extraction module based on ResNet.Secondly,to accurately capture temporal information in the time domain,a Temporal Attention Module(TAM)is designed to extract features from different time steps.Subsequently,a proposed Cross-Attention Feature Fusion Module(CAFM)integrates multimodal features,enhancing the network's ability to extract and express features.Finally,MCFNet is validated on the DeepShip dataset,achieving an accuracy of 98.80%,which surpasses other methods.The results confirmed the effectiveness of the proposed method.