A0507
Title: The two-sample problem in high dimension: A ranking-based method
Authors: Myrto Limnios - University of Copenhagen (Denmark) [presenting]
Stephan Clemencon - Telecom ParisTech (France)
Nicolas Vayatis - ENS Paris-Saclay (France)
Abstract: A general framework is proposed for testing the equality of two unknown probability distributions when considering two independent iid random samples, valued on a (same) measurable multivariate space. While there exists long-standing literature for the univariate setting, this problem remains a subject of research for both multivariate and nonparametric frameworks. Indeed, the increasing ability to collect large data of various structures, and possibly biased due to the collection process, for instance, has strongly defied classical modelings, particularly in applied fields such as biomedicine (clinical trials, genomics), marketing (AB testing). This method generalizes a particular class of permutation statistics known as two-sample linear rank statistics to multivariate spaces. By comparing the univariate image of the observations using a real-valued scoring function, a relation order is induced. The testing procedure is two-fold. 1) Maximization of the rank statistic: on the first half of each sample, we optimize a tailored version of the two-sample rank statistic over the class of scoring functions using ranking-based algorithms. 2) Two-sample homogeneity test: we perform the univariate rank test at fixed risk on the remaining observations, scored with the optimal scoring function of 1. Nonasymptotic theoretical guarantees are derived and numerical experiments modeling complex data structures compare and question both existing and present statistical tests.