大数据处理模型Apache Spark研究
Research on Apache Spark for Big Data Processing
黎文阳1
作者信息
- 1. 四川大学计算机学院,成都 610065
- 折叠
摘要
Apache Spark是当前流行的大数据处理模型,具有快速、通用、简单等特点。Spark是针对MapReduce在迭代式机器学习算法和交互式数据挖掘等应用方面的低效率,而提出的新的内存计算框架,既保留了MapReduce的可扩展性、容错性、兼容性,又弥补了MapReduce在这些应用上的不足。由于采用基于内存的集群计算,所以Spark在这些应用上比MapReduce快100倍。介绍Spark的基本概念、组成部分、部署模式,分析Spark的核心内容与编程模型,给出相关的编程示例。
Abstract
Apache Spark is a popular model for large scale data processing at present, which is fast, general and easy. Compared with the MapRe-duce computing framework, Spark is efficient in iterative machine learning algorithms and interactive data mining applications while re-taining the compatibility, scalability and fault-tolerance of MapReduce. With its in-memory computing, Spark is up to 100x faster than Hadoop MapReduce in memory. Presents the basic conception, component and the deploying mode of Spark, introduces the internal ab-straction and the programming model, gives the programming examples.
关键词
Spark/Hadoop/MapReduce/大数据/数据分析Key words
Spark/Hadoop/MapReduce/Big Data/Data Analysis引用本文复制引用
出版年
2015