COMPSTAT 2024: Start Registration
View Submission - COMPSTAT2024
A0456
Title: Approximate learning of parsimonious Bayesian context trees Authors:  Daniyar Ghani - Imperial College London (United Kingdom) [presenting]
Nick Heard - Imperial College London (United Kingdom)
Francesco Sanna Passino - Imperial College London (United Kingdom)
Abstract: Models for categorical sequences typically assume exchangeable or first-order dependent sequence elements. These are common assumptions, for example, in models of computer malware traces and protein sequences. Although such simplifying assumptions lead to computational tractability, the models often fail to capture long-range, complex dependence structures that may be harnessed for greater predictive power. To this end, a Bayesian modelling framework is proposed to capture rich dependence structures in categorical sequences, with memory efficiency suitable for real-time processing of data streams. Parsimonious Bayesian context trees are introduced as a form of variable-order Markov model with conjugate prior distributions. The novel framework requires fewer parameters than fixed-order Markov models by dropping redundant dependencies and clustering sequential contexts. An approximate inference algorithm is developed using model-based agglomerative clustering, and results are demonstrated on synthetic and real-world data examples. The proposed model outperforms existing sequence models when fitted to real protein sequences and honeypot command-line sessions.