- Machine Learning
- Causal Discovery
- Feature or Variable Selection
- Molecular Signatures
Epilogeas Project Abstract
Variable selection is a well-studied, solved problem in machine learning and data mining. Or not? While significant progress has been made when variable selection is employed to increase classification performance, novel and important directions need to be explored when
variable selection is employed for understanding the system under study. This is particularly the case when considering the analysis of high-dimensional data, such as omics data (i.e., transcriptomics, next generation sequencing, methylation, etc.).
We propose an intense research program to address current pressing needs, particularly in medicine and biology, but also in any high-dimensional data-analysis setting. We set forth new variable selection problems with deep connections to causality and causal theories.
We propose to study (a) variable selection for repeated measurements, longitudinal data, (b) variable selection when trying to predict the effect of interventions, such as knocking out a gene, (c) variable selection simultaneously from several heterogeneous datasets, as well as any prior knowledge, (d) variable selection to identify all optimal variable sets, not just one; this is particularly important for low sample size and in the presence of co-linearities , and (e) variable selection for hard distributions where pair-wise associations for important variables disappear (e.g. exclusive OR functions)
The algorithms will co-evolve with three importan biological applications to maximize the potential impact to human health, as well as ensure they are practical and usefull; (I) mesothelioma and lung cancer, (II) chronic lung diseases, and (III) DN - damage related aging. The applications will be supervised by our national and international biology collaborators.
To ensure rapid update of results we will encapsulate the algorithms in easy-to-use tools targetting non-expert users.