Target speaker extraction aims to extract the speech of the specific speaker from mixed audio,which usually treats the enrolled audio of the target speaker as auxiliary information.Existing approaches mainly have the following limitations:the auxiliary network for speaker recognition cannot capture the critical information from enrolled audio,and the second one is the lack of an interactive learning mechanism between mixed and enrolled audio embedding.These limitations lead to speaker confusion when the difference between the enrolled and target audio is significant.To address this,a speaker-aware cross-attention speaker extraction network(SACAN)is proposed.First,SACAN introduces an attention-based speaker aggregation module in the speaker recognition auxiliary network,which effectively aggregates critical information about target speaker characteristics.Then,it uses mixed audio to enhance target speaker embedding.After that,to promote the integration of speaker embedding and mixed audio embedding,SACAN builds an interactive learning mechanism through cross-attention and enhances the speaker perception ability of the model.The experimental results show that SACAN improves by 0.0133 and 1.0695 in terms of STOI and SI-SDRi when compared with the benchmark model,validating the effectiveness of the proposed module in speaker confusion assessment and ablation experiments.