异构环境感知的分布式神经网络训练模型

Distributed neural network training model for heterogeneous environment perception

咸琳涛 ¹刘晓兰 ¹王淦 ¹刘建明¹

扫码查看

作者信息

1. 潍坊医学院智能医学工程实验室,山东潍坊 261053
折叠

摘要

针对分布式神经网络训练在异构环境中训练速度慢、资源利用率低的问题,提出一种异构环境感知的分布式神经网络训练模型(H-PS).根据计算节点当前状态动态调度训练任务,使计算节点能够在相同时间完成训练任务,提高资源利用率.提出通信与计算并行策略,参数服务器与计算节点传输模型参数期间,计算节点持续模型计算,进一步提高资源利用率.使用灵活的量化策略,压缩神经网络模型参数,减少参数服务器与计算节点的通信开销.使用新兴的容器集群进行实验,结果表明,与现有方法相比,H-PS训练时间缩短1.4～3.5倍.

Abstract

To solve the problem that the low training speed and low resource utilizing of distributed neural network training in heterogeneous environment,a heterogeneous-aware parameter server with distributed neural network training(H-PS)was pro-posed.The resources of each worker were fully utilized by dynamically scheduling tasks based on the current status of the workers so that the workers completed their tasks at the same time.A pipeline scheme was proposed to further improve the effectiveness of workers by continuously model training of workers during the time of parameters transmitting between parameter server and workers.A flexible quantization scheme was proposed to reduce the communication overhead between the parameter server and workers by compressing the parameters of neural network model.An emerging container cluster for experiments was used.Experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x-3.5x when compared with existing methods.

关键词

分布式机器学习/异构环境/任务动态规划/通信与计算并行/参数动态量化/深度神经网络/容器集群

Key words

distributed machine learning(DML)/heterogeneous environments/dynamically scheduling tasks/pipeline commu-nication and computation/dynamic quantization parameter/deep neural networks/container cluster

引用本文复制引用

基金项目

潍坊医学院2023年校级研究课题基金项目(2023YBD005)

出版年

2024

计算机工程与设计

中国航天科工集团二院706所

计算机工程与设计

CSTPCD北大核心

影响因子：0.617

ISSN：1000-7024

段落导航