View Submission - HiTECCoDES2024
A0216
Title: Matching with text data: An experimental evaluation of methods for matching documents and of measuring match quality Authors:  Reagan Mozer - Bentley University (United States) [presenting]
Luke Miratrix - Harvard University (United States)
Abstract: Matching for causal inference is a well-studied problem, but standard methods fail when the units to match are text documents: the high-dimensional and rich nature of the data renders exact matching infeasible, causes propensity scores to produce incomparable matches, and makes assessing match quality difficult. A framework for matching text documents is characterized that decomposes existing methods into (1) the choice of text representation and (2) the choice of distance metric. It investigates how different choices within this framework affect both the quantity and quality of matches identified through a systematic multifactor evaluation experiment using human subjects. Altogether, over 100 unique text-matching methods are evaluated, along with five comparison methods taken from the literature. The experimental results identify methods that generate matches with higher subjective match quality than current state-of-the-art techniques. The precision of these results is enhanced by developing a predictive model to estimate the match quality of pairs of text documents as a function of the various distance scores. The model was found to successfully mimic human judgment and also allows for approximate and unsupervised evaluation of new procedures in the context. The identified best methods are then employed to illustrate the utility of text matching in two applications.