Subsampling Winner Algorithm from the Feature Space for Feature Selection in Large Regression Data: A Paradigm Shift

Research output: Contribution to conferencePaper

Abstract

Feature selection from a large number p of covariates in a regression analysis challenges data science, especially for scaling to ever-enlarging data and finding scientifically important features. The modern approach to feature selection in large-p data uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. The randomForest procedure is another alternative. We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA), subsampling from p features (not from n observations). Due to its subsampling nature, SWA can scale to data of any dimension in principle. SWA has the best-controlled false discovery rate in comparison with the aforementioned procedures while having a competitive true feature discovery rate, in a linear regression setting. We investigate the reasons behind its good performance, provide practical strategies to double assure an SWA selection, and discuss its extension to a more general setting. We shall also discuss computational improvements and SWA's relation with some machine learning algorithms.
Original languageEnglish
StatePublished - 2022
Event2022 Joint Statistical Meeting - Washington, DC
Duration: Jan 1 2022 → …

Conference

Conference2022 Joint Statistical Meeting
Period01/1/22 → …

Cite this