Features Ranking Techniques for Single Nucleotide Polymorphism Data
Abstract
Identifying biomarkers like single nucleotide polymorphisms (SNPs) is an important topic in biomedical applications. Such SNPs can be associated with an individual’s metabolism of drugs, which make these SNPs targets for drug therapy, and useful in personalized medicine applications. Yet another important application is that SNPs can be associated with an individual’s genetic predisposition to develop a disease. Identifying these associations allow proactive steps to be taken to hinder, delay or eliminate the disease. However, the problem is challenging; data are high dimensional and incomplete, and features (SNPs) are correlated. The goal of this thesis is to propose features ranking methods to reduce the number of selected features and the computational cost required to select these features in a binary classification task.
The main idea of the hypothesis is that specific values within a feature might be useful in predicting specific classes, while other values are not. In this context, three heuristic methods are applied to select the best features. The methods are applied to the Wellcome Trust Case Control Consortium (WTCCC1) dataset, and evaluated on Texas A&M University Qatar’s High Performance Computing platform.
The results show that the classification accuracy achieved by the proposed methods is comparable to the baseline. However, one of the proposed methods reduced the execution time of the feature selection and the number of features required to achieve similar accuracy in the baseline by 40% and 47% respectively.
DOI/handle
http://hdl.handle.net/10576/5800Collections
- Computing [100 items ]