CMStatistics 2015: Start Registration
View Submission - CMStatistics
B1567
Topic: Contributed on Machine learning, approximation and robustness Title: Positive unlabelled feature selection using constrained affinity graph embedding Authors:  Yufei Han - Symantec Research Lab (France) [presenting]
Yun Shen - Symantec Research Lab (United Kingdom)
Abstract: In real-world binary classification scenarios, such as network intrusion detection and web page categorisation, samples from negative class usually requires prohibitive overheads to label. Only a small proportion of positive data can be labelled explicitly by trustable oracles. Selecting relevant and non-redundant features given limited positively labelled data and unlabelled data is therefore highly desirable for accurate classification. So far as we know, no previously similar study attacks this problem. The proposed positive-unlabelled feature selection method learns a $L_1$-norm regularised robust linear regression of a constrained spectral graph embedding of training data on the feature representation. A set of noisy but informative must-link and cannot-link constraints are extracted using the given positively labelled samples and affinity graph of training data. These constraints are used to generate a constrained spectral graph embedding of training data, injecting partial supervision information into the feature selection procedure. The robust $L_1$-norm regularised regression model originates from correntropy theory. It is designed to suppress the impacts of noise in the pairwise constraints and identify the most informative features corresponding to the non-zero regression coefficients simultaneously. Experiments on two public benchmark data sets and one real-world network intrusion data set verify the method.