E-VIEW-Alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification

Evert S, Uhrig P, Bartsch S, Proisl T (2017)


Publication Language: English

Publication Type: Conference contribution, Original article

Publication year: 2017

Publisher: Lexical Computing

City/Town: Brno

Pages Range: 531–549

Conference Proceedings Title: Electronic Lexicography in the 21st Century. Proceedings of the eLex 2017 Conference

Event location: Leiden NL

URI: https://elex.link/elex2017/wp-content/uploads/2017/09/paper32.pdf

Open Access Link: https://elex.link/elex2017/wp-content/uploads/2017/09/paper32.pdf

Abstract

Statistical association measures (AM) play an important role in the automatic extraction of collocations and multiword expressions from corpora, but many parameters governing their performance are still poorly understood. Systematic evaluation studies have produced conflicting recommendations for an optimal AM, and little attention has been paid to other parameters such as the underlying corpus, the size of the co-occurrence context, or the application of a frequency threshold.

Our paper presents the results of a large-scale evaluation study covering 13 corpora, eight context sizes, four frequency thresholds, and 20 AMs against two different gold standards of lexical collocations. While the optimal choice of an AM depends strongly on the particular gold standard used, other parameters prove much more robust: (i) small co-occurrence contexts are better than larger spans, and the best results are usually obtained from syntactic dependencies; (ii) corpus quality is more important than sheer size, but large Web corpora prove to be a valid substitute for the British National Corpus; (iii) frequency thresholds seem to be unnecessary in most situations, as the statistical AMs successfully weed out rare and unreliable candidates; (iv) there is little interaction between the choice of AM and the other parameters.

In order to provide complete evidence for our observations to readers, we created an interactive Web-based application that allows users to manipulate all evaluation parameters and dynamically updates evaluation graphs and summaries.

Authors with CRIS profile

How to cite

APA:

Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-Alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification. In Iztok K, Carole T, Miloš J, Jelena K, Simon K, and Vít B (Eds.), Electronic Lexicography in the 21st Century. Proceedings of the eLex 2017 Conference (pp. 531–549). Leiden, NL: Brno: Lexical Computing.

MLA:

Evert, Stephanie, et al. "E-VIEW-Alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification." Proceedings of the eLex 2017, Leiden Ed. Iztok K, Carole T, Miloš J, Jelena K, Simon K, and Vít B, Brno: Lexical Computing, 2017. 531–549.

BibTeX: Download