A comparative study of machine learning algorithm models for predicting carbon emissions of residential buildings in cold zones
[Objective]Machine learning algorithms provide valuable data support for designing and optimizing low-carbon residential buildings.However,when used directly for carbon emission prediction and analysis,these models often lack proper parameter tuning and optimization.The different impacts of various independent variable datasets on predictive performance also remain to be clarified.In China's cold zones,where residential buildings share similar architectural structures,energy-saving designs,and spatial layouts,carbon emissions primarily come from the operational phase and the production stages of building materials,with heating emissions being a significant component.This study aims to elucidate the effectiveness of different machine learning algorithm models in guiding low-carbon residential design in these cold zones,offering architects criteria for selecting proper algorithms.This study focuses on automatic parameter tuning and optimization for several commonly used algorithms in the context of low-carbon design of buildings,including multiple linear regression,classification and regression tree,random forest,adaptive boosting,gradient boosting regression tree,and multilayer perceptron.The study compares and analyzes the performance limits and applicability of these algorithms and independent variable datasets in predicting carbon emissions during building material production and heating stages.[Methods]This paper elaborates on the target boundaries,parameter ranges,optimization processes,and validation methods for optimizing machine learning algorithm models.Through comprehensive research and simulation analysis of 37 reinforced concrete shear wall residential buildings and their derivative schemes in cold zones,multiple independent variable datasets suitable for establishing predictive models are identified.Cross-validation and grid search techniques are employed to optimize the predictive performance limits of different machine learning algorithms and independent variable datasets.Subsequently,120 models for predicting carbon emissions from building materials and 60 models for transforming steady-state heating consumption into dynamic heating consumption using the six mentioned algorithms are established.[Results]A horizontal comparison of the models reveals that algorithms such as multiple linear regression,random forest,and gradient boosting regression trees exhibit relatively good performance(R2 over 0.900)in carbon emission prediction after hyperparameter tuning across different independent variable datasets.Random forest and gradient boosting regression tree models excel in error control and offer similar predictive accuracy to multiple linear regression but lack interpretability.In contrast,multiple linear regression models provide clearer equations and stronger guidance for low-carbon design and optimization,focusing on carbon emission reduction during building material production or winter heating stages.Models based on the total residential building area exhibit optimal performance in predicting building material carbon emissions.Predictive models built on parameters such as the number of above-ground and underground floors,building width and depth,total household numbers,number of bedrooms for standard floor,and total number of residential bathrooms in the residence also demonstrate strong predictive capabilities for building material carbon emissions.For predicting the conversion coefficient during the heating stage,including the number of households and bedrooms per standard floor as independent variables significantly enhances predictive performance.[Conclusions]Although various machine learning models are useful for predicting residential building carbon emissions,the multiple linear regression model stands out owing to its excellent predictive performance and its intuitive representation of how design parameters affect carbon emissions.By utilizing different and appropriate independent variable datasets,such as the total number of floors,floor height,building dimensions,number of households and bedrooms on a floor,and corrected coefficients for urban meteorological parameters(including outdoor average temperature during the heating season,actual heating days,and roof and wall heat transfer coefficients),or by adopting the finally determined total building area,the multiple linear regression algorithm can deliver timely and multi-faceted guidance.These results are crucial for low-carbon design and optimization during the primary stages of the residential lifecycle in China,s cold zones.