CMStatistics 2015: Start Registration
View Submission - CMStatistics
B1263
Title: Using sampling methods to estimate rare stats on Twitter's graph Authors:  Antoine Rebecq - Universite Paris X (France) [presenting]
Abstract: Many computer science or social science studies about Twitter focus on the analysis of tweets alone, without knowing the number or characteristics of accounts who wrote the tweets. In fact, in many cases gaining access to the whole Twitter graph is very costly. The alternatives are the ``Rest API'', which allows only a few queries per hour and the ``Streaming API'', which only outputs a fraction (1 percent) of the tweets published in real time that match a certain search query. Many reasearchers choose the latter, mostly because the number of tweets output is much larger than when using the former. However, this suffers an additional drawback: the sampling method used to select the 1 percent tweets output is not disclosed by Twitter, which means classic unbiased estimators cannot be used. We propose to use the ``Rest API'' along with an adaptive sampling method that focuses on the estimation of rare quantities. This suits well the problem of estimating stats about accounts that produced a set of tweets matching a certain query because the Twitter graph is so large that most search queries will only be met by a very small fraction of vertices. We use our method to assess the number of accounts behind the 1 million tweets posted in less than 3 hours about the Pluto flyby of June 14th, 2015.