With the advent and popularity of ChatGPT,the implications and impacts of generative artificial intelligence have rapidly become focal points of attention in both academic and industrial circles.Within this wave of unsupervised deep learning led by large language models,a central issue revolves around training data.The pursuit of the scale and quality of training data epitomizes the dictum of"data as king"amidst the landscape of the"model war".Behind the values,functions,and misconceptions of training data lies a rewriting of the concept of data,a superstition regarding data affordance,and a struggle for data ownership.The specific structure and internal mechanism of training data have triggered the reconstruction of the intelligent communication ecosystem and the formation of a new information production order.The transformation also harbors a digital crisis caused by large language models,manifested in the reproduction of biases under distilled communications,the concretization of information under filtered communications,and the dissipation of meaning under stochastic communications.Both training data and large language models urgently need to dispel the myth of scale and focus more on how to integrate data effectively into social-technical systems.
large language modeltraining datagenerative AIChatGPTintelligent communications