Aiming at the problems of low recognition accuracy and complex training of existing speech recognition modules in complex environment,this paper proposes to improve the acoustic model of speech recognition by combining deep feedforward se-quential memory neural networks(DFSMN)and end-to-end connectist temporal classification(CTC).Secondly,in view of the poor representation ability of the existing acoustic feature representation methods in the deep neural network,based on the log Mel filter bank(Fbank)feature extraction method,this paper uses the convolutional neural networks(CNN)to extract the acoustic fea-tures twice,which solves the problem of the poor representation ability of the existing acoustic feature representation methods in the deep neural network.On the thchs-30 data set,the character error rate(CER)of the improved cnn-dfsmn-ctc model on the test set is reduced by 6.83%and 7.96%respectively compared with CNN model and LSTM model.