大数据2024,Vol.10Issue(1) :1-8.DOI:10.11959/j.issn.2096-0271.2024016

构建支持大模型训练的计算机系统需要考虑的4个问题

Four issues to consider in building a computer system supporting large model training

郑纬民
大数据2024,Vol.10Issue(1) :1-8.DOI:10.11959/j.issn.2096-0271.2024016

构建支持大模型训练的计算机系统需要考虑的4个问题

Four issues to consider in building a computer system supporting large model training

郑纬民1
扫码查看

作者信息

  • 1. 清华大学计算机科学与技术系,北京 100084
  • 折叠

摘要

支持大模型训练的计算机系统有3种类型,其中基于国产AI芯片系统的生态系统不是很好,要想改变这个局面,需要做好AI编译器、并行加速等10个关键软件;基于超级计算机的系统需要做好软硬件协同设计,从而更好地服务于大模型训练.针对如何搭建大模型的基础设施,提出4点平衡设计,以确保系统的性能、可靠性和可扩展性.

Abstract

There are three types of computer systems that support large model training,among which the ecosystem based on domestic AI chip systems is not very good.To change this situation,it is necessary to develop 10 key software such as AI compilers and parallel acceleration.Moreover,systems based on supercomputers require good software and hardware collaborative design to better serve large model training.This article proposes a 4-point balanced design for building the infrastructure of a large model to ensure system performance,reliability,and scalability.

关键词

大模型训练/计算机系统/超算系统/大模型基础设施

Key words

large model training/computer system/supercomputing system/large model infrastructure

引用本文复制引用

出版年

2024
大数据
人民邮电出版社

大数据

CSTPCD
ISSN:2096-0271
段落导航相关论文