大数据序贯检验方法及其应用

扫码查看

原文链接

万方数据
维普

中文摘要：分布的一致性检验在很多领域中得到了广泛的应用,它是统计学在众多应用中的一个基本主题.然而,随着大数据时代的到来,各个科学领域收集存储了丰富的数据.这些数据规模庞大、类型多样、结构复杂、更新速度快,传统的分布一致性检验方法受数据规模和存储方式的影响在处理和分析这类数据时面临着巨大的挑战.目前,分治策略是解决这类问题的主要方法,其核心思想是采用分布式框架对每个节点数据的计算结果进行集成以获取最终的结果.在处理大规模分布一致性检验问题时,这种对所有节点的检验结果进行集成的方式并不高效,特别是在数据分布存在明显差异时这种方式往往会增加检验的成本.因此,基于序贯检验的思想通过合理设置检验问题的"误差区域"对已有的分治策略进行优化,提出了一种分布式序贯检验方法.该方法在检验过程中不集成所有的节点数据,而是根据当前收集到的节点数据实时调整后续的决策,通过这种方式能够实现在不使用全部节点数据的前提下,做出正确的检验结果.模拟实验和实例分析结果表明:相比于已有的分治策略检验方法,所提出的分布式序贯检验方法能够在保证检验水平与功效的同时,提高分布式检验的计算效率,为解决临床试验、工业检验等领域中大规模数据检验成本高的问题提供了方法支撑.

外文标题：Sequential Testing Method and Its Application in Big Data

外文摘要：The consistency test of distributions has been widely applied in many fields and has been a fundamental theme of statistics in numerous applications.However,with the advent of the big data era,rich data have been collected and stored in various scientific domains.These data are characterized by large scale,diverse types,complex structures,and fast update rates.Traditional methods for distribution consistency tests are facing significant challenges in processing and analyzing such data due to the influence of data scale and storage methods.Currently,a divide-and-conquer strategy is the primary method for addressing such issues,with the core idea being the integration of calculation results for each node's data using a distributed framework to obtain the final result.However,when dealing with large-scale distribution consistency testing problems,the method of integrating test results from all nodes is not efficient,especially when there are significant differences in data distribution,which often increases the cost of testing.In response,based on the idea of sequential testing,a distributed sequential testing method is proposed to optimize existing divide-and-conquer strategies by appropriately setting the"error region"of the testing problem.This method sequentially compares the test statistic with a predetermined threshold,enabling the maintenance of test level and power without using all node data.Simulation experiments and case studies demonstrate that compared to traditional divide-and-conquer testing methods,the proposed distributed sequential testing method can make correct testing decisions using fewer node data,thereby improving the computational efficiency of distributed testing and providing methodological support for addressing the high testing costs of large-scale data in fields such as clinical trials and industrial inspections.

外文关键词：

divide and conquer strategybig datasequential testingdistributed framework

作者：

田梓璇、谢小月

展开 >

作者单位：

中国科学院大学数学科学学院,北京 100190

中国科学院大学数学与系统科学研究院,北京 100190

空军工程大学装备管理与无人机工程学院,陕西西安 710038

关键词：

分治策略大数据序贯检验分布式框架

基金：

陕西省自然科学基础研究计划资助项目

项目编号：

2023-JC-QN-0059

出版年：

2024

统计与信息论坛

西安财经学院,中国统计教育学会高教分会

统计与信息论坛

CSTPCDCSSCICHSSCD北大核心

影响因子：0.857

ISSN：1007-3116

年,卷(期)：2024.39(9)