首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
A Frisch-Waugh-Lovell theorem for empirical likelihood 经验似然的Frisch-Waugh-Lovell定理
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-23 DOI: 10.1016/j.csda.2025.108208
Yichun Song
A Frisch-Waugh-Lovell-type (FWL) theorem for empirical likelihood estimation with instrumental variables is presented, which resembles the standard FWL theorem in ordinary least squares (OLS), but its partitioning procedure employs the empirical likelihood weights at the solution rather than the original sample distribution. This result is leveraged to simplify the computational process through an iterative algorithm, where exogenous variables are partitioned out using weighted least squares, and the weights are updated between iterations. Furthermore, it is demonstrated that iterations converge locally to the original empirical likelihood estimate at a stochastically super-linear rate. A feasible iterative constrained optimization algorithm for calculating empirical-likelihood-based confidence intervals is provided, along with a discussion of its properties. Monte Carlo simulations indicate that the iterative algorithm is robust and produces results within the numerical tolerance of the original empirical likelihood estimator in finite samples, while significantly improves computation in large-scale problems. Additionally, the algorithm performs effectively in an illustrative application using the return to education framework.
提出了一种具有工具变量的经验似然估计的frisch - waugh - lovell型(FWL)定理,它类似于普通最小二乘(OLS)中的标准FWL定理,但其划分过程使用解处的经验似然权重而不是原始样本分布。利用这一结果,通过迭代算法简化计算过程,其中使用加权最小二乘法划分外生变量,并在迭代之间更新权重。进一步证明了迭代以随机超线性速率局部收敛于原始经验似然估计。给出了一种可行的基于经验似然的置信区间迭代约束优化算法,并对其性质进行了讨论。Monte Carlo仿真结果表明,迭代算法具有较强的鲁棒性,在有限样本情况下产生的结果在原始经验似然估计的数值公差范围内,同时在大规模问题中显著提高了计算能力。此外,该算法在使用回归教育框架的说明性应用程序中有效地执行。
{"title":"A Frisch-Waugh-Lovell theorem for empirical likelihood","authors":"Yichun Song","doi":"10.1016/j.csda.2025.108208","DOIUrl":"10.1016/j.csda.2025.108208","url":null,"abstract":"<div><div>A Frisch-Waugh-Lovell-type (FWL) theorem for empirical likelihood estimation with instrumental variables is presented, which resembles the standard FWL theorem in ordinary least squares (OLS), but its partitioning procedure employs the empirical likelihood weights at the solution rather than the original sample distribution. This result is leveraged to simplify the computational process through an iterative algorithm, where exogenous variables are partitioned out using weighted least squares, and the weights are updated between iterations. Furthermore, it is demonstrated that iterations converge locally to the original empirical likelihood estimate at a stochastically super-linear rate. A feasible iterative constrained optimization algorithm for calculating empirical-likelihood-based confidence intervals is provided, along with a discussion of its properties. Monte Carlo simulations indicate that the iterative algorithm is robust and produces results within the numerical tolerance of the original empirical likelihood estimator in finite samples, while significantly improves computation in large-scale problems. Additionally, the algorithm performs effectively in an illustrative application using the return to education framework.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108208"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144137907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heavy-tailed matrix-variate hidden Markov models 重尾矩阵变量隐马尔可夫模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-12 DOI: 10.1016/j.csda.2025.108198
Salvatore D. Tomarchio
The matrix-variate framework for hidden Markov models (HMMs) is expanded with two families of models using matrix-variate t and contaminated normal distributions. These models improve the handling of tail behavior, clustering, and address challenges in identifying outlying matrices in matrix-variate data. Two Expectation-Conditional Maximization (ECM) algorithms are implemented in the R package MatrixHMM for parameter estimation. Simulations assess parameter recovery, robustness, anomaly detection, and show the advantages over alternative approaches. The models are applied to real-world data to analyze labor market dynamics across Italian provinces.
将隐马尔可夫模型(hmm)的矩阵变量框架扩展为使用矩阵变量t和污染正态分布的两类模型。这些模型改进了尾部行为、聚类的处理,并解决了在矩阵变量数据中识别离群矩阵的挑战。在R包MatrixHMM中实现了两种期望-条件最大化(ECM)算法用于参数估计。模拟评估参数恢复,鲁棒性,异常检测,并显示优于替代方法的优势。这些模型被应用于现实世界的数据,以分析意大利各省的劳动力市场动态。
{"title":"Heavy-tailed matrix-variate hidden Markov models","authors":"Salvatore D. Tomarchio","doi":"10.1016/j.csda.2025.108198","DOIUrl":"10.1016/j.csda.2025.108198","url":null,"abstract":"<div><div>The matrix-variate framework for hidden Markov models (HMMs) is expanded with two families of models using matrix-variate <em>t</em> and contaminated normal distributions. These models improve the handling of tail behavior, clustering, and address challenges in identifying outlying matrices in matrix-variate data. Two Expectation-Conditional Maximization (ECM) algorithms are implemented in the R package <strong>MatrixHMM</strong> for parameter estimation. Simulations assess parameter recovery, robustness, anomaly detection, and show the advantages over alternative approaches. The models are applied to real-world data to analyze labor market dynamics across Italian provinces.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108198"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exact statistical analysis for response-adaptive clinical trials: A general and computationally tractable approach 反应适应性临床试验的精确统计分析:一种通用的、可计算的方法
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-19 DOI: 10.1016/j.csda.2025.108207
Stef Baas , Peter Jacko , Sofía S. Villar
Response-adaptive clinical trial designs allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. Response-adaptive designs face greater regulatory scrutiny due to potential type I error rate inflation, which limits their uptake in practice. Existing approaches for type I error control either only work for specific designs, have a risk of Monte Carlo/approximation error, are conservative, or computationally intractable. To this end, a general and computationally tractable approach is developed for exact analysis in two-arm response-adaptive designs with binary outcomes. This approach can construct exact tests for designs using either a randomized or deterministic response-adaptive procedure. The constructed conditional and unconditional exact tests generalize Fisher's and Barnard's exact tests, respectively. Furthermore, the approach allows for complexities such as delayed outcomes, early stopping, or allocation of participants in blocks. The efficient implementation of forward recursion allows for testing of two-arm trials with 1,000 participants on a standard computer. Through an illustrative computational study of trials using randomized dynamic programming it is shown that, contrary to what is known for equal allocation, the conditional exact Wald test based on total successes has, almost uniformly, higher power than the unconditional exact Wald test. Two real-world trials with the above-mentioned complexities are re-analyzed to demonstrate the value of the new approach in controlling type I errors and/or improving the statistical power.
反应适应性临床试验设计允许通过根据观察到的结果扭曲参与者的治疗分配来针对给定的目标。由于潜在的I型错误率膨胀,响应自适应设计面临更严格的监管审查,这限制了它们在实践中的应用。现有的I类误差控制方法要么只适用于特定的设计,要么有蒙特卡罗/近似误差的风险,要么是保守的,要么是计算上难以处理的。为此,在具有二元结果的双臂响应自适应设计中,开发了一种通用且计算易于处理的方法来进行精确分析。这种方法可以使用随机或确定性响应-自适应程序为设计构建精确的测试。构造的条件和无条件精确检验分别推广了Fisher和Barnard的精确检验。此外,该方法允许诸如延迟结果、提前停止或在块中分配参与者等复杂性。前向递归的有效实现允许在标准计算机上测试1,000名参与者的双臂试验。通过使用随机动态规划的试验的说明性计算研究表明,与已知的平均分配相反,基于总成功的条件精确沃尔德检验几乎一致地比无条件精确沃尔德检验具有更高的功率。重新分析了具有上述复杂性的两个现实世界试验,以证明新方法在控制I型误差和/或提高统计能力方面的价值。
{"title":"Exact statistical analysis for response-adaptive clinical trials: A general and computationally tractable approach","authors":"Stef Baas ,&nbsp;Peter Jacko ,&nbsp;Sofía S. Villar","doi":"10.1016/j.csda.2025.108207","DOIUrl":"10.1016/j.csda.2025.108207","url":null,"abstract":"<div><div>Response-adaptive clinical trial designs allow targeting a given objective by skewing the allocation of participants to treatments based on observed outcomes. Response-adaptive designs face greater regulatory scrutiny due to potential type I error rate inflation, which limits their uptake in practice. Existing approaches for type I error control either only work for specific designs, have a risk of Monte Carlo/approximation error, are conservative, or computationally intractable. To this end, a general and computationally tractable approach is developed for exact analysis in two-arm response-adaptive designs with binary outcomes. This approach can construct exact tests for designs using either a randomized or deterministic response-adaptive procedure. The constructed conditional and unconditional exact tests generalize Fisher's and Barnard's exact tests, respectively. Furthermore, the approach allows for complexities such as delayed outcomes, early stopping, or allocation of participants in blocks. The efficient implementation of forward recursion allows for testing of two-arm trials with 1,000 participants on a standard computer. Through an illustrative computational study of trials using randomized dynamic programming it is shown that, contrary to what is known for equal allocation, the conditional exact Wald test based on total successes has, almost uniformly, higher power than the unconditional exact Wald test. Two real-world trials with the above-mentioned complexities are re-analyzed to demonstrate the value of the new approach in controlling type I errors and/or improving the statistical power.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108207"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144099882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed variable screening for generalized linear models 广义线性模型的分布变量筛选
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-12 DOI: 10.1016/j.csda.2025.108203
Tianbo Diao , Bo Li , Lianqiang Qu , Liuquan Sun
In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility.
本文提出了一种广义线性模型的分布变量筛选方法。这种方法设计用于处理样本量和协变量数量都很大的情况。具体而言,该方法通过使用稀疏性限制的代理似然估计量来选择相关协变量。它考虑了协变量的联合效应,而不仅仅是边际效应,这一特点提高了筛选结果的可靠性。我们建立了该方法的可靠筛选特性,保证了所选模型有高概率包含真实模型。通过仿真研究来评估所提出方法的有限样本性能,并通过对真实数据集的应用展示了其实用性。
{"title":"Distributed variable screening for generalized linear models","authors":"Tianbo Diao ,&nbsp;Bo Li ,&nbsp;Lianqiang Qu ,&nbsp;Liuquan Sun","doi":"10.1016/j.csda.2025.108203","DOIUrl":"10.1016/j.csda.2025.108203","url":null,"abstract":"<div><div>In this article, we develop a distributed variable screening method for generalized linear models. This method is designed to handle situations where both the sample size and the number of covariates are large. Specifically, the proposed method selects relevant covariates by using a sparsity-restricted surrogate likelihood estimator. It takes into account the joint effects of the covariates rather than just the marginal effect, and this characteristic enhances the reliability of the screening results. We establish the sure screening property of the proposed method, which ensures that with a high probability, the true model is included in the selected model. Simulation studies are conducted to evaluate the finite sample performance of the proposed method, and an application to a real dataset showcases its practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108203"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A simultaneous confidence-bounded true discovery proportion perspective on localizing differences in smooth terms in regression models 回归模型中平滑项的局部化差异的同步置信度有界真发现比例视角
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-04-23 DOI: 10.1016/j.csda.2025.108197
David Swanson
A method is demonstrated for localizing where two spline terms, or smooths, differ using a true discovery proportion (TDP)-based interpretation. The procedure yields a statement on the proportion of some region where true differences exist between two smooths. The methodology avoids ad hoc approaches to making such statements, like subsetting the data and performing hypothesis tests on the truncated spline terms. TDP estimates are 1-α confidence-bounded simultaneously, which means that a region's TDP estimate is a lower bound on the proportion of actual differences, or true discoveries, in that region, with high confidence regardless of the number of estimates made. The procedure is based on closed-testing using Simes local test. This local test requires that the multivariate χ2 test statistics of generalized Wishart type underlying the method be positive regression dependent on subsets (PRDS), a result for which evidence is presented suggesting that the condition holds. Consistency of the procedure is demonstrated for generalized additive models with the tuning parameter chosen by REML or GCV, and the achievement of confidence-bounded TDP is shown in simulation as is an analysis of walking gait.
演示了一种方法,用于定位两个样条项,或平滑,使用真实发现比例(TDP)为基础的解释不同。这个过程产生一个关于某些区域的比例的陈述,其中两个平滑之间存在真正的差异。该方法避免了对数据进行细分和对截断的样条项进行假设检验等特别的方法。TDP估计值同时具有1-α置信限,这意味着一个地区的TDP估计值是该地区实际差异或真实发现比例的下界,无论估计值的数量如何,都具有高置信度。该程序基于使用Simes本地测试的封闭测试。该局部检验要求该方法基础的广义Wishart类型的多变量χ2检验统计量是依赖于子集的正回归(PRDS),该结果有证据表明该条件成立。对于采用REML或GCV选择的整定参数的广义加性模型,证明了该过程的一致性,并通过仿真和步态分析证明了置信度有界TDP的实现。
{"title":"A simultaneous confidence-bounded true discovery proportion perspective on localizing differences in smooth terms in regression models","authors":"David Swanson","doi":"10.1016/j.csda.2025.108197","DOIUrl":"10.1016/j.csda.2025.108197","url":null,"abstract":"<div><div>A method is demonstrated for localizing where two spline terms, or smooths, differ using a true discovery proportion (TDP)-based interpretation. The procedure yields a statement on the proportion of some region where true differences exist between two smooths. The methodology avoids ad hoc approaches to making such statements, like subsetting the data and performing hypothesis tests on the truncated spline terms. TDP estimates are 1-<em>α</em> confidence-bounded simultaneously, which means that a region's TDP estimate is a lower bound on the proportion of actual differences, or true discoveries, in that region, with high confidence regardless of the number of estimates made. The procedure is based on closed-testing using Simes local test. This local test requires that the multivariate <span><math><msup><mrow><mi>χ</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> test statistics of generalized Wishart type underlying the method be positive regression dependent on subsets (PRDS), a result for which evidence is presented suggesting that the condition holds. Consistency of the procedure is demonstrated for generalized additive models with the tuning parameter chosen by REML or GCV, and the achievement of confidence-bounded TDP is shown in simulation as is an analysis of walking gait.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108197"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143906892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible modeling of left-truncated and interval-censored competing risks data with missing event types 具有缺失事件类型的左截尾和区间截尾竞争风险数据的灵活建模
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-06-05 DOI: 10.1016/j.csda.2025.108229
Yichen Lou , Yuqing Ma , Liming Xiang , Jianguo Sun
Interval-censored competing risks data arise in many cohort studies in clinical research, where multiple types of events subject to interval censoring are included and the occurrence of the primary event of interest may be censored by the occurrence of other events. The presence of missing event types and left truncation poses challenges to the regression analysis of such data. We propose a new two-stage estimation procedure under a class of semiparametric generalized odds rate transformation models to overcome these challenges. Our method first facilitates the estimation of both the probability of response and the probability of occurrence of each type of event under the missing at random assumption, using either parametric or non-parametric methods. An augmented inverse probability weighting likelihood based on the complete-case likelihood and data from subjects with missing type of event is then maximized for estimating regression parameters. We provide desirable asymptotic properties and construct a concordance index to evaluate the model's discriminative ability. The proposed method is demonstrated through extensive simulations and the analysis of data from the Amsterdam cohort study on HIV infection and AIDS.
间隔审查竞争风险数据出现在临床研究中的许多队列研究中,其中包括受间隔审查的多种类型的事件,并且主要感兴趣事件的发生可能被其他事件的发生所审查。缺失事件类型和左截断的存在对此类数据的回归分析提出了挑战。为了克服这些挑战,我们在一类半参数广义比值率变换模型下提出了一种新的两阶段估计方法。我们的方法首先使用参数或非参数方法,便于在随机缺失假设下估计响应概率和每种事件发生的概率。然后,基于完全案例似然和缺失事件类型的受试者数据的增广逆概率加权似然最大化用于估计回归参数。我们给出了理想的渐近性质,并构造了一个一致性指标来评价模型的判别能力。提出的方法是通过广泛的模拟和数据分析从阿姆斯特丹队列研究艾滋病毒感染和艾滋病证明。
{"title":"Flexible modeling of left-truncated and interval-censored competing risks data with missing event types","authors":"Yichen Lou ,&nbsp;Yuqing Ma ,&nbsp;Liming Xiang ,&nbsp;Jianguo Sun","doi":"10.1016/j.csda.2025.108229","DOIUrl":"10.1016/j.csda.2025.108229","url":null,"abstract":"<div><div>Interval-censored competing risks data arise in many cohort studies in clinical research, where multiple types of events subject to interval censoring are included and the occurrence of the primary event of interest may be censored by the occurrence of other events. The presence of missing event types and left truncation poses challenges to the regression analysis of such data. We propose a new two-stage estimation procedure under a class of semiparametric generalized odds rate transformation models to overcome these challenges. Our method first facilitates the estimation of both the probability of response and the probability of occurrence of each type of event under the missing at random assumption, using either parametric or non-parametric methods. An augmented inverse probability weighting likelihood based on the complete-case likelihood and data from subjects with missing type of event is then maximized for estimating regression parameters. We provide desirable asymptotic properties and construct a concordance index to evaluate the model's discriminative ability. The proposed method is demonstrated through extensive simulations and the analysis of data from the Amsterdam cohort study on HIV infection and AIDS.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108229"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144242893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Small area prediction of counts under machine learning-type mixed models 机器学习混合模型下计数的小面积预测
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-30 DOI: 10.1016/j.csda.2025.108218
Nicolas Frink, Timo Schmid
Small area estimation methods are proposed that use generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, two existing approaches based on random forests - the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF) - are extended to accommodate count outcomes, addressing key challenges such as overdispersion. Additionally, three bootstrap methodologies designed to assess the reliability of point estimators for area-level means are evaluated. The numerical analysis shows that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. In a case study using real-world data from the state of Guerrero, Mexico, the proposed methods effectively estimate area-level means while capturing the uncertainty inherent in overdispersed count data. These findings highlight their practical applicability for small area estimation.
提出了一种小区域估计方法,该方法使用基于广义树的机器学习技术来改进使用离散调查数据的小区域分解均值的估计。具体来说,现有的两种基于随机森林的方法——广义混合效应随机森林(GMERF)和混合效应随机森林(MERF)——得到了扩展,以适应计数结果,解决了过度分散等关键挑战。此外,三种bootstrap方法旨在评估点估计器的可靠性为区域水平的平均值进行了评估。数值分析表明,MERF不假设泊松分布来模拟计数数据的平均行为,在严重过分散的情况下表现出色。相反,GMERF在适度满足泊松分布假设的条件下表现最佳。在使用来自墨西哥Guerrero州的真实数据的案例研究中,所提出的方法有效地估计了面积水平的平均值,同时捕获了过度分散计数数据中固有的不确定性。这些发现突出了它们在小面积估计中的实际适用性。
{"title":"Small area prediction of counts under machine learning-type mixed models","authors":"Nicolas Frink,&nbsp;Timo Schmid","doi":"10.1016/j.csda.2025.108218","DOIUrl":"10.1016/j.csda.2025.108218","url":null,"abstract":"<div><div>Small area estimation methods are proposed that use generalized tree-based machine learning techniques to improve the estimation of disaggregated means in small areas using discrete survey data. Specifically, two existing approaches based on random forests - the Generalized Mixed Effects Random Forest (GMERF) and a Mixed Effects Random Forest (MERF) - are extended to accommodate count outcomes, addressing key challenges such as overdispersion. Additionally, three bootstrap methodologies designed to assess the reliability of point estimators for area-level means are evaluated. The numerical analysis shows that the MERF, which does not assume a Poisson distribution to model the mean behavior of count data, excels in scenarios of severe overdispersion. Conversely, the GMERF performs best under conditions where Poisson distribution assumptions are moderately met. In a case study using real-world data from the state of Guerrero, Mexico, the proposed methods effectively estimate area-level means while capturing the uncertainty inherent in overdispersed count data. These findings highlight their practical applicability for small area estimation.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108218"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Penalized maximum likelihood estimation with nonparametric Gaussian scale mixture errors 非参数高斯尺度混合误差的惩罚最大似然估计
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-16 DOI: 10.1016/j.csda.2025.108206
Seo-Young Park , Byungtae Seo
The penalized least squares and maximum likelihood methods have been successfully employed for simultaneous parameter estimation and variable selection. However, outlying observations can severely affect the quality of the estimator and selection performance. Although some robust methods for variable selection have been proposed in the literature, they often lose substantial efficiency. This is primarily attributed to the excessive dependence on choosing additional tuning parameters or modifying the original objective functions as tools to enhance robustness. In response to these challenges, we use a nonparametric Gaussian scale mixture distribution for the regression error distribution. This approach allows the error distributions in the model to achieve great flexibility and provides data-adaptive robustness. Our proposed estimator exhibits desirable theoretical properties, including sparsity and oracle properties. In the estimation process, we employ a combination of expectation-maximization and gradient-based algorithms for the parametric and nonparametric components, respectively. Through comprehensive numerical studies, encompassing simulation studies and real data analysis, we substantiate the robust performance of the proposed method.
惩罚最小二乘和极大似然方法已成功地用于同时进行参数估计和变量选择。然而,离群观测值会严重影响估计器的质量和选择性能。虽然文献中提出了一些稳健的变量选择方法,但它们往往失去了实质性的效率。这主要是由于过度依赖于选择额外的调优参数或修改原始目标函数作为增强鲁棒性的工具。为了应对这些挑战,我们使用非参数高斯尺度混合分布作为回归误差分布。这种方法使模型中的误差分布具有很大的灵活性,并提供了数据自适应的鲁棒性。我们提出的估计器展示了理想的理论特性,包括稀疏性和oracle特性。在估计过程中,我们分别对参数和非参数分量采用了期望最大化和基于梯度的组合算法。通过全面的数值研究,包括模拟研究和实际数据分析,我们证实了该方法的鲁棒性。
{"title":"Penalized maximum likelihood estimation with nonparametric Gaussian scale mixture errors","authors":"Seo-Young Park ,&nbsp;Byungtae Seo","doi":"10.1016/j.csda.2025.108206","DOIUrl":"10.1016/j.csda.2025.108206","url":null,"abstract":"<div><div>The penalized least squares and maximum likelihood methods have been successfully employed for simultaneous parameter estimation and variable selection. However, outlying observations can severely affect the quality of the estimator and selection performance. Although some robust methods for variable selection have been proposed in the literature, they often lose substantial efficiency. This is primarily attributed to the excessive dependence on choosing additional tuning parameters or modifying the original objective functions as tools to enhance robustness. In response to these challenges, we use a nonparametric Gaussian scale mixture distribution for the regression error distribution. This approach allows the error distributions in the model to achieve great flexibility and provides data-adaptive robustness. Our proposed estimator exhibits desirable theoretical properties, including sparsity and oracle properties. In the estimation process, we employ a combination of expectation-maximization and gradient-based algorithms for the parametric and nonparametric components, respectively. Through comprehensive numerical studies, encompassing simulation studies and real data analysis, we substantiate the robust performance of the proposed method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108206"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144090448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantile Super Learning for independent and online settings with application to solar power forecasting 分位数超级学习独立和在线设置应用于太阳能发电预测
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-05-09 DOI: 10.1016/j.csda.2025.108202
Herbert Susmann , Antoine Chambaz
Estimating quantiles of an outcome conditional on covariates is of fundamental interest in statistics with broad application in probabilistic prediction and forecasting. An ensemble method for conditional quantile estimation is proposed, Quantile Super Learning, that combines predictions from multiple candidate algorithms based on their empirical performance measured with respect to a cross-validated empirical risk of the quantile loss function. Theoretical guarantees for both i.i.d. and online data scenarios are presented. The performance of this approach for quantile estimation and in forming prediction intervals is tested in simulation studies. Two case studies related to solar energy are used to illustrate Quantile Super Learning: in an i.i.d. setting, we predict the physical properties of perovskite materials for photovoltaic cells, and in an online setting we forecast ground solar irradiance based on output from dynamic weather ensemble models.
估计以协变量为条件的结果的分位数是统计学中的一个基本问题,在概率预测和预测中有着广泛的应用。提出了一种条件分位数估计的集成方法,即分位数超级学习,该方法结合了来自多个候选算法的预测,这些算法基于分位数损失函数的交叉验证的经验风险测量的经验性能。本文给出了对i.i.d和在线数据场景的理论保证。仿真研究验证了该方法在分位数估计和预测区间形成方面的性能。两个与太阳能相关的案例研究用于说明分位数超级学习:在i.i.d设置中,我们预测光伏电池的钙钛矿材料的物理性质;在在线设置中,我们根据动态天气集合模型的输出预测地面太阳辐照度。
{"title":"Quantile Super Learning for independent and online settings with application to solar power forecasting","authors":"Herbert Susmann ,&nbsp;Antoine Chambaz","doi":"10.1016/j.csda.2025.108202","DOIUrl":"10.1016/j.csda.2025.108202","url":null,"abstract":"<div><div>Estimating quantiles of an outcome conditional on covariates is of fundamental interest in statistics with broad application in probabilistic prediction and forecasting. An ensemble method for conditional quantile estimation is proposed, Quantile Super Learning, that combines predictions from multiple candidate algorithms based on their empirical performance measured with respect to a cross-validated empirical risk of the quantile loss function. Theoretical guarantees for both i.i.d. and online data scenarios are presented. The performance of <em>this</em> approach for quantile estimation and in forming prediction intervals is tested in simulation studies. Two case studies related to solar energy are used to illustrate Quantile Super Learning: in an i.i.d. setting, we predict the physical properties of perovskite materials for photovoltaic cells, and in an online setting we forecast ground solar irradiance based on output from dynamic weather ensemble models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108202"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143942605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed iterative hard thresholding for variable selection in Tobit models Tobit模型中变量选择的分布式迭代硬阈值
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-01 Epub Date: 2025-06-03 DOI: 10.1016/j.csda.2025.108227
Changxin Yang , Zhongyi Zhu , Hongmei Lin , Zengyan Fan , Heng Lian
While there is a substantial body of research on high-dimensional regression with left-censored responses, few methods address this problem in a distributed manner. Due to data transmission limitations and privacy concerns, centralizing all data is often impractical, necessitating a method for collaborative learning with distributed data. In this paper, we employ the Iterative Hard Thresholding (IHT) method for the Tobit model to address this challenge, allowing one to directly specify the desired sparsity and offering an alternative estimation and variable selection approach. Theoretical analysis shows that our estimator achieves a nearly minimax-optimal convergence rate using only a few rounds of communication. Its practical performance is evaluated under both the pooled and the distributed setting. The former highlights its competitive estimation efficiency and variable selection performance compared to existing approaches, while the latter demonstrates that the decentralized estimator closely matches the performance of its centralized counterpart. When applied to high-dimensional left-censored HIV viral load data, our method also demonstrates comparable performance.
虽然有大量关于左删节响应的高维回归的研究,但很少有方法以分布式的方式解决这个问题。由于数据传输限制和隐私问题,集中所有数据通常是不切实际的,因此需要一种使用分布式数据进行协作学习的方法。在本文中,我们对Tobit模型采用迭代硬阈值(IHT)方法来解决这一挑战,允许人们直接指定所需的稀疏性,并提供替代估计和变量选择方法。理论分析表明,我们的估计器仅使用几轮通信就达到了接近最小最大最优收敛速率。在池化和分布式两种情况下对其实际性能进行了评价。与现有方法相比,前者突出了其具有竞争力的估计效率和变量选择性能,而后者则表明分散估计器的性能与集中式估计器的性能非常匹配。当应用于高维左删节HIV病毒载量数据时,我们的方法也显示出相当的性能。
{"title":"Distributed iterative hard thresholding for variable selection in Tobit models","authors":"Changxin Yang ,&nbsp;Zhongyi Zhu ,&nbsp;Hongmei Lin ,&nbsp;Zengyan Fan ,&nbsp;Heng Lian","doi":"10.1016/j.csda.2025.108227","DOIUrl":"10.1016/j.csda.2025.108227","url":null,"abstract":"<div><div>While there is a substantial body of research on high-dimensional regression with left-censored responses, few methods address this problem in a distributed manner. Due to data transmission limitations and privacy concerns, centralizing all data is often impractical, necessitating a method for collaborative learning with distributed data. In this paper, we employ the Iterative Hard Thresholding (IHT) method for the Tobit model to address this challenge, allowing one to directly specify the desired sparsity and offering an alternative estimation and variable selection approach. Theoretical analysis shows that our estimator achieves a nearly minimax-optimal convergence rate using only a few rounds of communication. Its practical performance is evaluated under both the pooled and the distributed setting. The former highlights its competitive estimation efficiency and variable selection performance compared to existing approaches, while the latter demonstrates that the decentralized estimator closely matches the performance of its centralized counterpart. When applied to high-dimensional left-censored HIV viral load data, our method also demonstrates comparable performance.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"211 ","pages":"Article 108227"},"PeriodicalIF":1.5,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144203578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1