Abstract
Feature selection from a large number p of covariates in a regression analysis challenges data science, especially for scaling to ever-enlarging data and finding scientifically important features. The modern approach to feature selection in large-p data uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. The randomForest procedure is another alternative. We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA), subsampling from p features (not from n observations). Due to its subsampling nature, SWA can scale to data of any dimension in principle. SWA has the best-controlled false discovery rate in comparison with the aforementioned procedures while having a competitive true feature discovery rate, in a linear regression setting. We investigate the reasons behind its good performance, provide practical strategies to double assure an SWA selection, and discuss its extension to a more general setting. We shall also discuss computational improvements and SWA's relation with some machine learning algorithms.
| Original language | English |
|---|---|
| State | Published - 2022 |
| Event | 2022 Joint Statistical Meeting - Washington, DC Duration: Jan 1 2022 → … |
Conference
| Conference | 2022 Joint Statistical Meeting |
|---|---|
| Period | 01/1/22 → … |
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver