View Submission

A0277

Title: Training a classifier via semi-supervised learning Authors: Geoffrey McLachlan - University of Queensland (Australia) [presenting]
Abstract: There has been much increasing attention to semi-supervised learning (SSL) approaches in machine learning for forming a classifier in situations where the training data for a classifier consists of a limited number of classified observations but a much larger number of unclassified observations whose labels denoting their class of origin are unknown. The surprising result of a prior study is considered further, that a classifier formed from a partially classified sample can actually have a smaller expected error rate than if the sample were completely classified. This rather paradoxical outcome is able to be achieved by introducing a framework with a missingness mechanism for the missing labels of the unclassified observations. Within this framework, the conditional probability q(y) that an observation with feature vector y has a missing label is taken to be a logistic model with covariate equal to an entropy-based measurement e(y). The extension of the model is considered for q(y) to the two-component mixture model, c + (1-c) q(y), where c is the probability that a feature has a label that is missing completely at random (MCAR). The asymptotic relative efficiency of the estimated Bayes' classifier is derived. Results are presented to show how its relative efficiency falls away as c increases. The focus is on two classes in which y has a multivariate normal distribution.