COMPSTAT 2022: Start Registration
View Submission - COMPSTAT2022
A0568
Title: On edge significance in large phylogenetic (spanning) trees Authors:  Alexandre Francisco - INESC-ID and IST, Universidade de Lisboa (Portugal) [presenting]
Pedro Monteiro - INESC-ID and IST Universidade de Lisboa (Portugal)
Guilherme Ribeiro - INESC-ID and IST Universidade de Lisboa (Portugal)
Andreia Sofia Teixeira - LASIGE FCUL and INESC-ID Universidade de Lisboa (Portugal)
Abstract: Tree-like structures are common in phylogenetic studies and, in intra-species studies with thousands of strains, minimum spanning trees (MST) and their generalizations are often computed. The underlying model assumes that edge weights (or just a total order) represent evolutionary distance. Given that MST are not unique in general, and some edges can be equally selected, it is important to compute their probability of being in an MST, i.e., their spanning edge betweenness (SEB). SEB can be computed exactly in $O(m n^{1.5})$ time (for $n$ strains/vertices and $m$ edges) through an extension of the well-known Kirchhoff's matrix tree theorem. The running time is however prohibitive for large datasets. But it turns out that SEB is equal to the effective resistance in electrical resistive networks and, for unweighted graphs, it can be provably approximated in $O~(m \log^2(n) \log(1/\varepsilon))$, by relying on provably fast linear system solvers for symmetric diagonally dominant (SDD) matrices. This result can be extended to weighted graphs when weights are probabilities and the MST weight is defined as the weight product, but its extension to the general case is not trivial. We propose then to address this problem by edge layering and combining partial results (as we did in the exact case), but controlling the approximation error through sampling and using the jackknife estimate of derived values.