Skip to main navigation Skip to search Skip to main content

Subsampling from features in large regression to find “winning features”

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever-enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false-discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large-p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best-controlled FDR compared to the penalized and Random Forest procedures while having a competitive true-feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.
Original languageEnglish
Pages (from-to)168-184
Number of pages17
JournalStatistical Analysis and Data Mining
Volume14
Issue number2
DOIs
StatePublished - Apr 1 2021

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Keywords

  • feature selection
  • high dimensions
  • subsampling winner algorithm (SWA)
  • variable selection

Cite this