Sequential Testing Method and Its Application in Big Data
The consistency test of distributions has been widely applied in many fields and has been a fundamental theme of statistics in numerous applications.However,with the advent of the big data era,rich data have been collected and stored in various scientific domains.These data are characterized by large scale,diverse types,complex structures,and fast update rates.Traditional methods for distribution consistency tests are facing significant challenges in processing and analyzing such data due to the influence of data scale and storage methods.Currently,a divide-and-conquer strategy is the primary method for addressing such issues,with the core idea being the integration of calculation results for each node's data using a distributed framework to obtain the final result.However,when dealing with large-scale distribution consistency testing problems,the method of integrating test results from all nodes is not efficient,especially when there are significant differences in data distribution,which often increases the cost of testing.In response,based on the idea of sequential testing,a distributed sequential testing method is proposed to optimize existing divide-and-conquer strategies by appropriately setting the"error region"of the testing problem.This method sequentially compares the test statistic with a predetermined threshold,enabling the maintenance of test level and power without using all node data.Simulation experiments and case studies demonstrate that compared to traditional divide-and-conquer testing methods,the proposed distributed sequential testing method can make correct testing decisions using fewer node data,thereby improving the computational efficiency of distributed testing and providing methodological support for addressing the high testing costs of large-scale data in fields such as clinical trials and industrial inspections.
divide and conquer strategybig datasequential testingdistributed framework