Abstract
Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever-enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false-discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large-p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best-controlled FDR compared to the penalized and Random Forest procedures while having a competitive true-feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.
| Original language | English |
|---|---|
| Pages (from-to) | 168-184 |
| Number of pages | 17 |
| Journal | Statistical Analysis and Data Mining |
| Volume | 14 |
| Issue number | 2 |
| DOIs | |
| State | Published - Apr 1 2021 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 3 Good Health and Well-being
Keywords
- feature selection
- high dimensions
- subsampling winner algorithm (SWA)
- variable selection
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver