A0763
Title: Streaming prediction with hash function based methods
Authors: Aleena Chanda - University of Nebraska - Lincoln (United States) [presenting]
Abstract: The traditional empirical distribution function (EDF) becomes computationally and memory-intensive for large data streams, limiting their utility in real-time applications. A novel method is proposed for estimating a distribution function in streaming data built on the count-min sketch algorithm. The estimated empirical distribution function (EEDF) overcomes these limitations by using probabilistic hash functions to approximate frequency distributions with limited memory effectively. By dynamically adjusting histogram interval lengths, the method provides fine-grained approximations of the empirical distribution without storing all data points. The algorithm operates in a single pass and maintains computational efficiency, making it well-suited for streaming settings. While the count-min sketch exhibits a slight upward bias, particularly for low-frequency elements, the effect is minimal compared to its scalability advantages. In predictive contexts, the median of the EEDF typically performed better than most other predictors tested but roughly tied with Gaussian process prior (GPP) predictors that included a bias term. It is concluded that Bayesian and Bayes-like methods are typically among the most effective approaches for prediction in M-open settings.