基于优化RDD分区的Spark并行K-means大尺度遥感图像分割

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：大尺度遥感图像分割对单机处理方式而言是巨大挑战.Spark平台为在单机上构建用于大数据处理的分布式计算环境提供了可能.当Spark平台内置的K-means算法用于数字图像处理时,其中的Spark Shuffle弹性分布式数据集(RDD)分区一般采用缺省设置,尽管这种RDD设置简单便捷,但对大尺度图像分割任务容易造成"多分区、小数据"现象,极大影响图像分割速度.为此,采用覆盖部分上海市区的WorldView-3遥感图像为测试数据,在K-means算法初始化聚类中心阶段自定义影响RDD分区的参数spark.sql.shuffle.partitions,在迭代计算阶段调用coalesce()算子减少分区数;与串行K-means算法对比验证单机处理大数据的可行性与有效性,与优化前的Spark并行K-means算法对比实现了大尺度遥感图像快速分割,实验结果表明,在K-means算法初始化聚类中心和迭代计算阶段,将RDD分区数设置在CPU核数的1～10倍,总用时由优化前的145 s缩减到97 s,尤其在初始化聚类中心阶段的时间效率上,优化后是优化前的500～1000倍.

外文标题：Spark parallel K-means large scale remote sensing image segmentation based on optimized RDD partition

外文摘要：It is a great challenge for segmentating large scale remote sensing images on a single computer.The Spark platform makes it possible to build a distributed computing environment for big data processing on a single computer.When the K-means algorithm built in Spark platform is used for digital images processing,the Spark Shuffle resilient distributed dataset(RDD)partition generally adopts the default setting.Although this RDD setup is simple and convenient,it is easy to cause the phenomenon of"excessive partition and too little data"in the large scale images segmentation task,which greatly affects the image segmentation speed.Therefore,this paper utilizes the built-in K-means algorithm for segmenting the WorldView-3 images coving part of Shanghai city,which properly defines the RDD partition parameter spark.sql.shuffle.partitions during the initializing clustering centers stage and adaptively calls the coalesce()operator to adjust the number of RDD partitions during iteration.Comparing with the serial K-means algorithm,the feasibility and effectiveness of single computer processing of big data are verified.Comparing with the Spark parallel K-means algorithm with the parameters for the default setting,the proposed algorithm can realize large-scale image segmentation faster.The experimental results show that,in the both stages of initialization and iterative computation of cluster centers for the K-means algorithm,the RDD partition number is set at 1-10 times of the CPU core number,which reduces the total time from 145s before optimization to 97s.Especially in the time efficiency of the initializing cluster center stage,the time efficiency after optimization is 500-1000 times that before optimization.

外文关键词：

Spark platformsingle computer big data processinglarge-scale remote sensing imagesRDD optimizationimage segmentationparallel K-means algorithm

作者：

李玉、崔书琳、赵泉华

展开 >

作者单位：

辽宁工程技术大学测绘与地理科学学院,辽宁阜新 123000

关键词：

Spark平台单机大数据处理大尺度遥感图像 RDD优化图像分割并行K-means算法

基金：

辽宁省自然科学基金辽宁省教育厅重点攻关项目

项目编号：

2022-MS-400LJ2020ZD003

出版年：

2024

DOI：

10.13195/j.kzyjc.2022.1717

控制与决策

东北大学

控制与决策

CSTPCD北大核心

影响因子：1.227

ISSN：1001-0920

年,卷(期)：2024.39(5)

参考文献量28