基于Spark平台的分类算法性能比较分析

扫码查看

原文链接

国家科技期刊平台
NETL
NSTL
万方数据

中文摘要：针对目前大数据与机器学习技术的快速发展,使用基于Spark平台的MLlib机器学习库实现前馈神经网络(Feedforward Artificial Neural Network)、支持向量机(Support Vector Machine)与随机森林(Random Forest)三种机器学习算法,并分析与评估三种算法在大数据平台下的运行与分类性能.实验结果表明,随着节点数的增加,三种算法在大数据平台上消耗的时间都逐步变少.当数据集小于100MB时神经网络与支持向量机算法加速比较高,数据集大于1GB时随机森林算法加速比优于其他两种算法.神经网络算法在数据集100MB时可扩展性最小,支持向量机算法在数据集500MB时可扩展性最小.随机森林算法在数据集大于1GB时规模增长性优于其他两种算法.通过对于三种分类算法的时间效率与准确性比较,支持向量机算法消耗的时间最少,但是分类准确性最低.神经网络算法消耗的时间最长,分类准确性低于随机森林算法.随机森林算法的分类准确性最高,但是算法运行时间高于支持向量机算法.集成分类算法在大数据平台上表现出较好的时间性能与分类准确性.

外文标题：Performance Comparison and Analysis of Classification Algorithms Based on Spark Platform

外文摘要：In view of the rapid development of big data and machine learning technology,MLlib machine learning library based on Spark platform is used to implement feedforward artificial neural network,support vector machine and random forest,three machine learning algorithms,the operation and classification performance of the three algorithms under the big data platform are analyzed and evaluated.The experimental results show that with the increase of the number of nodes,the time consumed by the three algorithms on the big data platform gradually decreases.When the dataset is less than 100MB,the acceleration ratio of neural network and support vector machine algorithm is higher,and when the dataset is larger than 1GB,the acceleration ratio of random forest algorithm is better than the other two algorithms.The neural network algorithm has the least scalability when the data set is 100MB,and the support vector machine algorithm has the least scalability when the data set is 500MB.The random forest algorithm has better scale growth than the other two algorithms when the data set is larger than 1GB.By comparing the time efficiency and ac-curacy of the three classification algorithms,the SVM algorithm consumes the least time,but the classification accuracy is the low-est.Neural network algorithm consumes the longest time,and the classification accuracy is lower than random forest algorithm.Ran-dom forest algorithm has the highest classification accuracy,but its running time is higher than support vector machine algorithm.The integrated classification algorithm shows better time performance and classification accuracy on the big data platform.

外文关键词：

big dataHadoop frameworkSpark frameworkmachine learningperformance evaluation

作者：

赵蕾、夏吉安、吴洋、崔辉

展开 >

作者单位：

南京工业职业技术大学计算机与软件学院南京 210023

关键词：

大数据 Hadoop框架 Spark框架机器学习性能评估

基金：

中国高校产学研创新基金(2020)江苏省产学研合作项目(2022)江苏省工业软件工程技术研究项目(2020)

项目编号：

2020HYB02005BY2022560ZK20-04-12

出版年：

2024

DOI：

10.3969/j.issn.1672-9722.2024.03.009

计算机与数字工程

中国船舶重工集团公司第七0九研究所

计算机与数字工程

CSTPCD

影响因子：0.355

ISSN：1672-9722

年,卷(期)：2024.52(3)

参考文献量4