Time Series Modeling of Distribution-valued Data Based on Centered Log-ratio Transformation
With the progress of information technology and the development of the digital era,the acquisition of data is greatly facilitated,on the basis,data sets with a large number of observations are emerging in many fields of natural science and social science.Symbolic data analysis is an efficient tool to deal with large-scale data sets.A common type of symbolic data,named distribution-valued data,also known as numerical modal data are studied according to the definitions of symbolic data analysis,in which a probability distribution is characterized and it is particularly suitable for information mining of massive observations,including interval-valued data,histogram-valued data and general distribution-valued data.In recent years,numerous excellent achievements have emerged in the field of distribution-valued data analysis,among which the theoretical research and practical application of statistical analysis methods have received extensive attention from many scholars.However,due to the lack of effective representation methods and reasonable algebraic operations,existing methods are often subject to some constraints,and may lead to certain analytical errors in calculation,which bring many difficulties to statistical modeling.To deal with the problem of non-closed linear operations for distribution-valued data,the centered log-ratio transformation(clr)method is innovatively applied to the representation and modeling process of the distribution-valued data.The clr method can transform the probability density function into a general function,and then the addition,subtraction and multiplication operations in the function space can be used.The rules of calculation in the transformed function space and the sample statistics of the distribution-valued time series are defined,and the rationality of these definitions is explained.Due to the important role of the numerical characteristics of variables in the identification and estimation process of time series models,and in order to extend classic time series models under the Box-Jenkins framework to distribution-valued data,the numerical characteristics of distribution-valued data are first defined by linear operations and inner products of functions.Based on these definitions,Distributional-AR,Distributional-MA and Distributional-ARMA models are proposed for distribution-valued time series and the modeling process is provided including model specification,parameter estimation and model diagnostics.The proposed method is referred to as the clr-DTS method.Furthermore,a synthetic distribution-valued time series data set is constructed to demonstrate the modeling process of the clr-DTS method.Moreover,the effectiveness of parameter estimation of the proposed method is illustrated through Monte Carlo experiments.Finally,apply the proposed clr-DTS method to model and predict the air quality index(AQI)monitoring data in Beijing,and then compare it with two existing methods in modeling and out-of-sample prediction effect.The results show that the proposed method has better model fitting,higher accuracy,and more stable prediction effect.
symbolic datadistribution-valued datacentered log-ratio transformationtime seriesBayesian space