Research on coagulation dosing prediction based on K-Means clustering and ensemble learning under small sample data
A PAC dosage prediction method was proposed to address small sample size issues in coagulant dosage prediction.The method was based on K-Means clustering and ensemble learning.Firstly,Water quality was divided into three categories using K-Means clustering based on raw water turbidity and water temperature.The training and test sets were then extracted from the data using stratified sampling.Secondly,a PAC dosage ensemble prediction model(KM-Bagging)was constructed based on the Bagging ensemble learning algorithm.The model consisted of seven learners:Support Vector Machine,Random Forest,Adaboost,Gradient Boosting Decision Tree,Catboost,XGBoost,and LightGBM.The method was validated using operational data from a water supply plant in Yinchuan City from 2021 to 2022.The results showed that the KM-Bagging model had high prediction accuracy for small sample sizes,with an R2 exceeding 0.8 and MAPE less than 5%.When 6-and 9-month daily monitoring data were used to predict PAC dosing,the model was suitable for cases where monitoring time was short and high accuracy was not required.The predicted results can be used as a reference for adjusting the PAC dosage when there was a sudden change in raw water quality.When one year of daily monitoring data was used to predict PAC dosing,the prediction accuracy met the requirements for engineering applications and provided auxiliary guidance for actual PAC dosage in water treatment plants.The results of study can provide reference value for modeling coagulant dosage prediction with small sample data.