Spark parallel K-means large scale remote sensing image segmentation based on optimized RDD partition
It is a great challenge for segmentating large scale remote sensing images on a single computer.The Spark platform makes it possible to build a distributed computing environment for big data processing on a single computer.When the K-means algorithm built in Spark platform is used for digital images processing,the Spark Shuffle resilient distributed dataset(RDD)partition generally adopts the default setting.Although this RDD setup is simple and convenient,it is easy to cause the phenomenon of"excessive partition and too little data"in the large scale images segmentation task,which greatly affects the image segmentation speed.Therefore,this paper utilizes the built-in K-means algorithm for segmenting the WorldView-3 images coving part of Shanghai city,which properly defines the RDD partition parameter spark.sql.shuffle.partitions during the initializing clustering centers stage and adaptively calls the coalesce()operator to adjust the number of RDD partitions during iteration.Comparing with the serial K-means algorithm,the feasibility and effectiveness of single computer processing of big data are verified.Comparing with the Spark parallel K-means algorithm with the parameters for the default setting,the proposed algorithm can realize large-scale image segmentation faster.The experimental results show that,in the both stages of initialization and iterative computation of cluster centers for the K-means algorithm,the RDD partition number is set at 1-10 times of the CPU core number,which reduces the total time from 145s before optimization to 97s.Especially in the time efficiency of the initializing cluster center stage,the time efficiency after optimization is 500-1000 times that before optimization.
Spark platformsingle computer big data processinglarge-scale remote sensing imagesRDD optimizationimage segmentationparallel K-means algorithm