CMStatistics 2015: Start Registration
View Submission - CMStatistics
B0694
Title: Variable selection in high-dimensional data sets using GPU Authors:  Witold Rudnicki - University of Bialystok (Poland) [presenting]
Szymon Migacz - NVIDIA Inc (United States)
Krzysztof Mnich - University of Bialystok (Poland)
Antoni Rosciszewski - University of Warsaw (Poland)
Andrzej Sulecki - University of Warsaw (Poland)
Pawel Tabaszewski - University of Warsaw (Poland)
Abstract: The aim is to describe new algorithms for knowledge discovery that are possible due to high computational power of GPUs. We focus on algorithmic side and provide illustrations with biomedical applications. The algorithms are general and can be applied to any problem described with millions of features. The ideas originate from problems in modern molecular biology, which generates datasets with thousands and even millions of features. Finding relevant features is crucial for building explanatory models. Their huge number combined with relatively small number of objects and weak signal hidden behind inevitable noise constitutes really hard problem. We have developed the feature selection scheme in the information system based on exhaustive search in two and more dimensions. We generate all possible pairs, triplets etc. of variables and compute information gain for decision variable due to each $n-$tuple of variables in reference to the a priori distribution. The variables of $n-$tuples that are statistically significant are deemed relevant and selected for further analysis. The GPU implementation of the algorithm is capable of analysing billions of $n-$tuples per second, therefore enabling the exhaustive search of synergistic effects in large multi-dimensional datasets.