Benchmarking text-integrated protein language model embeddings and embedding fus ion on diverse downstream tasks

扫码查看

原文链接

NETL
NSTL

外文摘要：By a News Reporter-Staff News Editor at Robotics & Machine Learning DailyNews Daily News - According to news reporting based on a preprint abstract, our journalists obtained thefollowing quote sourced from bi orxiv.org:“Protein language models (pLMs) have traditionally been trained in an unsupervis ed manner usinglarge protein sequence databases with an autoregressive or maske d-language modeling training paradigm.Recent methods have attempted to enhance pLMs by integrating additional information, in the form oftext, which are refer red to as \”text+protein\” l anguage models (tpLMs). We evaluate and compare sixtpLMs (OntoProtein, ProteinD T, ProtST, ProteinCLIP, ProTrek, and ESM3) against ESM2, a baselinetext-free pL M, across six downstream tasks designed to assess the learned protein representa tions. Wefind that while tpLMs outperform ESM2 in five out of six benchmarks, n o tpLM was consistently the best.Thus, we additionally investigate the potentia l of embedding fusion, exploring whether the combinationsof tpLM embeddings can improve performance on the benchmarks by exploiting the strengths of multiple tpLMs. We find that combinations of tpLM embeddings outperform single tpLM embedd ings in five out ofsix benchmarks, highlighting its potential as a useful strat egy in the field of machine-learning for proteins.

外文关键词：

BioinformaticsBiotechnologyBiotechno logy - BioinformaticsCyborgsEmerging TechnologiesInformation TechnologyM achine Learning

出版年：

2024

Robotics & Machine Learning Daily News

ISSN：

年,卷(期)：2024.(Sep.5)