首页|基于全局自适应宽度注意力改进的Transformer

基于全局自适应宽度注意力改进的Transformer

扫码查看
Transformer在自然语言处理中运用广泛,但存在文本长度过长带来的输入信息被切割、显存占用太大的问题,已有的解决方法是让模型动态决定每层注意力宽度,可以在控制计算量和显存开销的前提下关联最优序列长度,但存在每层最优的注意力宽度并不能达到模型最优注意力宽度的缺点。为此,提出一种全层自适应宽度注意力模型(GAA)。让每层的注意力范围和全局关联,实现模型全局注意力范围最优,还将模型的前馈层修改为带门控单元的前馈层(FFNGLU)。在数据集enwiki8和text-8上的验证表明,该方法仅使用25%的训练计算成本,即可达到比基线更好的性能。
IMPROVED TRANSFORMER BASED ON GLOBAL ADAPTIVE WIDTH ATTENTION
Transformer is widely-used in natural language processing,but there is a problem that the input information is cut and the video memory is too large caused by the long text.The existing solution is to let the model dynamically determine the attention width of each layer,and it can associate the optimal sequence length under the premise of controlling calculation amount and memory footprint overhead.However,there is the disadvantage that the optimal attention width of each layer cannot reach the optimal attention width of the model.For this reason,we propose a global adaptive width attention(GAA).We let the attention range of each layer be associated with the global,so as to achieve the optimal global attention range of the model,and modified the feedforward layer of the model to the feedforward layer of the gated unit.Validations on the data sets enwiki8 and text-8 show that this method only uses 25%of the training calculation cost to achieve better performance than the baseline.

TransformerGlobal adaptive width attentionFFNGLU

曾庆威、张建、张鸿昌、谭雨阳、沈文枫

展开 >

上海大学 上海 210000

Transformer 全局自适应宽度注意力 FFNGLU

上海智能计算系统工程技术研究中心项目国家重点研发计划项目上海市科学技术委员会项目

19DZ22526002017YFB070160019511121002

2024

计算机应用与软件
上海市计算技术研究所 上海计算机软件技术开发中心

计算机应用与软件

CSTPCD北大核心
影响因子:0.615
ISSN:1000-386X
年,卷(期):2024.41(7)