面向数据湖存取性能优化的数据并行处理技术研究

Research on Data Parallel Processing Technology for Data Lake Access Performance Optimization

扫码查看

原文链接

NETL
NSTL
万方数据

中文摘要：本文围绕数据湖应用背景下海量数据的高性能存取需求,以新型数据存储模型和分布式存储及缓存机制为目标,通过对数据湖存储结构、数据访问模式和数据处理方法进行分析,开展数据湖存取性能优化问题研究.首先,结合数据湖系统中的文件系统存储方式,设计了一种基于列式存储的数据存储结构,并通过索引优化技术提高数据访问速度.其次,针对数据湖中常用的批处理和流处理这两种访问模式,提出了一种基于数据分区和缓存机制的访问优化方案,以提高数据访问的效率和稳定性.最后,针对数据湖在大规模数据情况下更新和增量计算出现的计算时间长问题,提出了一种基于Spark并行计算和分布式文件系统(Hadoop Distributed File System,HDFS)的数据处理方案,以提高数据处理的速度和可靠性.实验结果表明,本文提出的数据湖存储性能优化技术相对于现有方法能够有效地提高数据的存储、访问和处理效率.

外文摘要：This paper focuses on the high-performance data access requirements for massive data in the context of data lake applications,with the goal of designing a new data storage model and a distributed storage and caching mechanism. Through the analysis of the storage structure,data access patterns,and data processing methods in data lake,this paper conducts research on optimizing data lake access performance. First,by combining the file system storage method in data lake systems,a column-based data storage structure is designed and data index optimization technique is utilized to improve data access speed. Second,in order to improve the efficiency and stability of data access,a data access optimization method based on data partitioning and caching mechanism is proposed for the two commonly used access modes in data lakes,namely batch processing and stream processing. Finally,a data processing scheme based on Spark parallel computing and Hadoop Distributed File System ( HDFS) is proposed to solve the problem of long computation time for data lake updates and incremental calculation in large-scale data scenarios. Experimental results show that the techniques for data lake access performance optimization,proposed in this paper,can effectively improve the data storage,access and processing efficiency compared with the existing methods.

外文关键词：

data lakeaccess performancedata partitioningparallel computingindex optimization

作者：

赵卓峰、陈元、梅宇生

展开 >

作者单位：

北方工业大学信息学院,北京100144

大规模流数据集成与分析技术北京市重点实验室,北京100144

关键词：

数据湖存取性能数据分区并行计算索引优化

出版年：

2024

北方工业大学学报

北方工业大学

北方工业大学学报

影响因子：0.368

ISSN：1001-5477

年,卷(期)：2024.36(3)