查看更多>>摘要:Statistical foundations are without question at the core of modern innovation. In today's economy, a common phrase is "data is the new gold." Certainly, we live in an age where data is large, ubiquitous, and comes in many forms. The contributions from the statistical sciences go beyond data. We are emerging from a pandemic where statisticians around the globe saved lives by contributing critical understanding to vaccines, treatments, pandemic policies, and management. The contributions from statistical foundations are universal-from self-driving cars to Mars rovers, to sustainable and improved infrastructure, to clean energy and environmental stewardship, to financial markets and investing, to advances in medicine and medical practices, and toward a better understanding of the communities in which we live, work, learn, and play. This article summarizes my 2022 Presidential speech given at the Joint Statistical Meetings in Washington, DC. I highlight these important contributions and the innovations they made possible and will look to new opportunities on the horizon. Further, I reflect on advances the ASA has taken as an organization in data science and AI, establishing the ASA leadership institute, and third a specific focus on community analytics.
查看更多>>摘要:Molecular sequence variation at a locus informs about the evolutionary history of the sample and past population size dynamics. The Kingman coalescent is used in a generative model of molecular sequence variation to infer evolutionary parameters. However, it is well understood that inference under this model does not scale well with sample size. Here, we build on recent work based on a lower resolution coalescent process, the Tajima coalescent, to model longitudinal samples. While the Kingman coalescent models the ancestry of labeled individuals, we model the ancestry of individuals labeled by their sampling time. We propose a new inference scheme for the reconstruction of effective population size trajectories based on this model and the infinite-sites mutation model. Modeling of longitudinal samples is necessary for applications (e.g., ancient DNA and RNA from rapidly evolving pathogens like viruses) and statistically desirable (variance reduction and parameter identifiability). We propose an efficient algorithm to calculate the likelihood and employ a Bayesian nonparametric procedure to infer the population size trajectory. We provide a new MCMC sampler to explore the space of heterochronous Tajima's genealogies and model parameters. We compare our procedure with state-of-the-art methodologies in simulations and an application to ancient bison DNA sequences. Supplementary materials for this article are available online including a standardized description of the materials available for reproducing the work.
Jingyi Jessica LiHeather J. ZhouPeter J. BickelXin Tong...
2450-2463页
查看更多>>摘要:Motivated by the pressing needs for dissecting heterogeneous relationships in gene expression data, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued variables, with or without an index variable that specifies the line memberships. We construct the generalized Pearson correlation squares by focusing on three aspects: variable exchangeability, no parametric model assumptions, and inference of population-level parameters. To compute the generalized Pearson correlation square from a sample without a line-membership specification, we develop a /(-lines clustering algorithm to find K clusters that exhibit distinct linear dependences, where K can be chosen in a data-adaptive way. To infer the population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and show the power advantage of the generalized Pearson correlation squares in capturing mixtures of linear dependences. Gene expression data analyses demonstrate the effectiveness of the generalized Pearson correlation squares and the K-lines clustering algorithm in dissecting complex but interpretable relationships. The estimation and inference procedures are implemented in the R package gR2. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
查看更多>>摘要:Emerging single cell technologies that simultaneously capture long-range interactions of genomic loci together with their DNA methylation levels are advancing our understanding of three-dimensional genome structure and its interplay with the epigenome at the single cell level. While methods to analyze data from single cell high throughput chromatin conformation capture (scHi-C) experiments are maturing, methods that can jointly analyze multiple single cell modalities with scHi-C data are lacking. Here, we introduce Muscle, a semi-nonnegative joint decomposition of Multiple single cell tensors, to jointly analyze 3D conformation and DNA methylation data at the single cell level. Muscle takes advantage of the inherent tensor structure of the scHi-C data, and integrates this modality with DNA methylation. We developed an alternating least squares algorithm for estimating Muscle parameters and established its optimality properties. Parameters estimated by Muscle directly align with the key components of the downstream analysis of scHi-C data in a cell type specific manner. Evaluations with data-driven experiments and simulations demonstrate the advantages of the joint modeling framework of Muscle over single modality modeling and a baseline multi modality modeling for cell type delineation and elucidating associations between modalities. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
查看更多>>摘要:In many clinical settings, an active-controlled trial design (e.g., a non-inferiority or superiority design) is often used to compare an experimental medicine to an active control (e.g., an FDA-approved, standard therapy). One prominent example is a recent phase 3 efficacy trial, HIV Prevention Trials Network Study 084 (HPTN 084), comparing long-acting cabotegravir, a new HIV pre-exposure prophylaxis (PrEP) agent, to the FDA-approved daily oral tenofovir disoproxil fumarate plus emtricitabine (TDF/FTC) in a population of heterosexual women in 7 African countries. One key complication of interpreting study results in an active-controlled trial like HPTN 084 is that the placebo arm is not present and the efficacy of the active control (and hence the experimental drug) compared to the placebo can only be inferred by leveraging other data sources. In this article, we study statistical inference for the intention-to-treat (ITT) effect of the active control using relevant historical placebo-controlled trials data under the potential outcomes (PO) framework. We highlight the role of adherence and unmeasured confounding, discuss in detail identification assumptions and two modes of inference (point vs. partial identification), propose estimators under identification assumptions permitting point identification, and lay out sensitivity analyses needed to relax identification assumptions. We applied our framework to estimating the intention-to-treat effect of daily oral TDF/FTC versus placebo in HPTN 084 using data from an earlier Phase 3, placebo-controlled trial of daily oral TDF/FTC (Partners PrEP). Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
查看更多>>摘要:Earth system models (ESMs) are fundamental for understanding Earth's complex climate system. However, the computational demands and storage requirements of ESM simulations limit their utility. For the newly published CESM2-LENS2 data, which suffer from this issue, we propose a novel stochastic generator (SG) as a practical complement to the CESM2, capable of rapidly producing emulations closely mirroring training simulations. Our SG leverages the spherical harmonic transformation (SHT) to shift from spatial to spectral domains, enabling efficient low-rank approximations that significantly reduce computational and storage costs. By accounting for axial symmetry and retaining distinct ranks for land and ocean regions, our SG captures intricate nonstationary spatial dependencies. Additionally, a modified Tukey g-and-h (TGH) transformation accommodates non-Gaussianity in high-temporal-resolution data. We apply the proposed SG to generate emulations for surface temperature simulations from the CESM2-LENS2 data across various scales, marking the first attempt of reproducing daily data. These emulations are then meticulously validated against training simulations. This work offers a promising complementary pathway for efficient climate modeling and analysis while overcoming computational and storage limitations. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Zihang WangIrina GaynanovaAleksandr AravkinBenjamin B. Risk...
2508-2520页
查看更多>>摘要:Independent component analysis (ICA) is widely used to estimate spatial resting-state networks and their time courses in neuroimaging studies. It is thought that independent components correspond to sparse patterns of co-activating brain locations. Previous approaches for introducing sparsity to ICA replace the non-smooth objective function with smooth approximations, resulting in components that do not achieve exact zeros. We propose a novel Sparse ICA method that enables sparse estimation of independent source components by solving a non-smooth non-convex optimization problem via the relax-and-split framework. The proposed Sparse ICA method balances statistical independence and sparsity simultaneously and is computationally fast. In simulations, we demonstrate improved estimation accuracy of both source signals and signal time courses compared to existing approaches. We apply our Sparse ICA to cortical surface resting-state fMRI in school-aged autistic children. Our analysis reveals differences in brain activity between certain regions in autistic children compared to children without autism. Sparse ICA selects coactivating locations, which we argue is more interpretable than dense components from popular approaches. Sparse ICA is fast and easy to apply to big data. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
查看更多>>摘要:There is a growing interest in cell-type-specific analysis from bulk samples with a mixture of different cell types. A critical first step in such analyses is the accurate estimation of cell-type proportions in a bulk sample. Although many methods have been proposed recently, quantifying the uncertainties associated with the estimated cell-type proportions has not been well studied. Lack of consideration of these uncertainties can lead to missed or false findings in downstream analyses. In this article, we introduce a flexible statistical deconvolution framework that allows a general and subject-specific covariance of bulk gene expressions. Under this framework, we propose a decorrelated constrained least squares method called DECALS that estimates cell-type proportions as well as the sampling distribution of the estimates. Simulation studies demonstrate that DECALS can accurately quantify the uncertainties in the estimated proportions whereas other methods fail. Applying DECALS to analyze bulk gene expression data of post mortem brain samples from the ROSMAP and GTEx projects, we show that taking into account the uncertainties in the estimated cell-type proportions can lead to more accurate identifications of cell-type-specific differentially expressed genes and transcripts between different subject groups, such as between Alzheimer's disease patients and controls and between males and females. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Rupam BhattacharyyaNicholas C. HendersonVeerabhadran Baladandayuthapani
2533-2547页
查看更多>>摘要:Rapid advancements in collection and dissemination of multi-platform molecular and genomics data has resulted in enormous opportunities to aggregate such data in order to understand, prevent, and treat human diseases. While significant improvements have been made in multi-omic data integration methods to discover biological markers and mechanisms underlying both prognosis and treatment, the precise cellular functions governing these complex mechanisms still need detailed and data-driven de-novo evaluations. We propose a framework called Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data (fiBAG), that allows simultaneous identification of upstream functional evidence of proteogenomic biomarkers and the incorporation of such knowledge in Bayesian variable selection models to improve signal detection. fiBAG employs a conflation of Gaussian process models to quantify (possibly nonlinear) functional evidence via Bayes factors, which are then mapped to a novel calibrated spike-and-slab prior, thus, guiding selection and providing functional relevance to the associations with patient outcomes. Using simulations, we illustrate how integrative methods with functional calibration have higher power to detect disease related markers than non-integrative approaches. We demonstrate the profitability of fiBAG via a pan-cancer analysis of 14 cancer types to identify and assess the cellular mechanisms of proteogenomic markers associated with cancer sternness and patient survival. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
Gregor ZensSylvia Fruehwirth-SchnatterHelga Wagner
2548-2559页
查看更多>>摘要:Modeling binary and categorical data is one of the most commonly encountered tasks of applied statisticians and econometricians. While Bayesian methods in this context have been available for decades now, they often require a high level of familiarity with Bayesian statistics or suffer from issues such as low sampling efficiency. To contribute to the accessibility of Bayesian models for binary and categorical data, we introduce novel latent variable representations based on Polya-Gamma random variables for a range of commonly encountered logistic regression models. From these latent variable representations, new Gibbs sampling algorithms for binary, binomial, and multinomial logit models are derived. All models allow for a conditionally Gaussian likelihood representation, rendering extensions to more complex modeling frameworks such as state space models straightforward. However, sampling efficiency may still be an issue in these data augmentation based estimation frameworks. To counteract this, novel marginal data augmentation strategies are developed and discussed in detail. The merits of our approach are illustrated through extensive simulations and real data applications. Supplementary materials for this article are available online.