IMPROVED TRANSFORMER BASED ON GLOBAL ADAPTIVE WIDTH ATTENTION
Transformer is widely-used in natural language processing,but there is a problem that the input information is cut and the video memory is too large caused by the long text.The existing solution is to let the model dynamically determine the attention width of each layer,and it can associate the optimal sequence length under the premise of controlling calculation amount and memory footprint overhead.However,there is the disadvantage that the optimal attention width of each layer cannot reach the optimal attention width of the model.For this reason,we propose a global adaptive width attention(GAA).We let the attention range of each layer be associated with the global,so as to achieve the optimal global attention range of the model,and modified the feedforward layer of the model to the feedforward layer of the gated unit.Validations on the data sets enwiki8 and text-8 show that this method only uses 25%of the training calculation cost to achieve better performance than the baseline.