Breast cancer is one of the most common cancers.Predicting 5-year survival based on patient genomics data is a common task in breast cancer research.To address the problems of noise,heterogeneity,long sequences,and the imbalance of positive and negative samples in genomics data from breast cancer patients,a 5-year survival predic-tion MLBSP model for breast cancer prognosis based on multi-modal learning is proposed.The model uses a single-modal module to extract effective information from four modes of data:gene expression data,the cumulative number of gene mutations,single nucleotide variations,and copy number variations.To reduce the impact of the heterogeneity of single-mode data on global features,deep separable convolution and a multi-head self-attention mechanism are used as the multi-modal module architecture to fuse the data features,capture the global information of patients'multi-modal genome data,and use Focal Loss to solve the problem of the imbalance between positive and negative samples,to guide the 5-year survival prediction.The experimental results showed that the Area Under the Curve(AUC)of the MLBSP model for data from BRCA Cell,METABRIC,and PanCancer Altas,which are real data sets from breast can-cer patients,reached 91.18%,71.49%,and 77.37%,respectively.The AUC of the MLBSP model is 17.69%,6.51%,and 10.24%higher on average than the AUCs of XGBoost,random forest,and other mainstream cancer survival prediction methods,respectively.Pathway analysis identified some biomarkers,such as SLC8A3 and TP 53,further demonstrating the novelty and effectiveness of multi-modal research.
breast cancergenomicsdeep learningdeep separable convolutionmulti-head self-attentionmulti-modal learning