The continuous change of speech synthesis and conversion technology poses a major threat to the voiceprint recognition system.To deal with the problem that the existing voice spoofing detection methods are difficult to adapt to multiple spoofing types and have insufficient ability to detect unknown spoofing attacks,a spoofed speech detection model combining Convolutional Neural Network(CNN)and Transformer is proposed.A location-aware feature sequence extraction network based on SE-ResNet18 embedded with Coordinate Attention(CA)is designed,which maps the local time-frequency representation of speech signals into high-dimensional feature sequences and introduces two-Dimensional Position Encoding(2D-PE)to preserve the relative position relationship between features.The multi-scale self-attention mechanism is proposed to model the long-term dependence between feature sequences from multiple scales,which solves the problem that it is difficult for Transformer to capture local dependencies.Feature Sequence Pooling(SeqPool)is introduced to extract discourse-level features,and the correlation information between frame-level feature sequences output by the Transformer layer is retained.The experimental results on the official Logic Access(LA)data set of the ASVspoof2019 competition show that,compared with the current advanced spoofed speech detection system,the proposed method reduces the Equal Error Rate(EER)by an average of 12.83%,and the tandem Detection Cost Function(t-DCF)by an average of 7.81%.