E-VIEW-Alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification

Conference contribution
(Original article)


Publication Details

Author(s): Evert S, Uhrig P, Bartsch S, Proisl T
Editor(s): Iztok K, Carole T, Miloš J, Jelena K, Simon K, and Vít B
Publisher: Lexical Computing
Publishing place: Brno
Publication year: 2017
Conference Proceedings Title: Electronic Lexicography in the 21st Century. Proceedings of the eLex 2017 Conference
Pages range: 531–549
ISSN: 2533-5626
Language: English


Abstract


Statistical association measures (AM) play an important role in the automatic extraction of collocations and multiword expressions from corpora, but many parameters governing their performance are still poorly understood. Systematic evaluation studies have produced conflicting recommendations for an optimal AM, and little attention has been paid to other parameters such as the underlying corpus, the size of the co-occurrence context, or the application of a frequency threshold.



Our paper presents the results of a large-scale evaluation study covering 13 corpora, eight context sizes, four frequency thresholds, and 20 AMs against two different gold standards of lexical collocations. While the optimal choice of an AM depends strongly on the particular gold standard used, other parameters prove much more robust: (i) small co-occurrence contexts are better than larger spans, and the best results are usually obtained from syntactic dependencies; (ii) corpus quality is more important than sheer size, but large Web corpora prove to be a valid substitute for the British National Corpus; (iii) frequency thresholds seem to be unnecessary in most situations, as the statistical AMs successfully weed out rare and unreliable candidates; (iv) there is little interaction between the choice of AM and the other parameters.



In order to provide complete evidence for our observations to readers, we created an interactive Web-based application that allows users to manipulate all evaluation parameters and dynamically updates evaluation graphs and summaries.



FAU Authors / FAU Editors

Evert, Stefan Prof. Dr.
Lehrstuhl für Korpus- und Computerlinguistik
Proisl, Thomas
Lehrstuhl für Korpus- und Computerlinguistik
Uhrig, Peter Dr.
Lehrstuhl für Anglistik, insbesondere Linguistik


How to cite

APA:
Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-Alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification. In Iztok K, Carole T, Miloš J, Jelena K, Simon K, and Vít B (Eds.), Electronic Lexicography in the 21st Century. Proceedings of the eLex 2017 Conference (pp. 531–549). Leiden, NL: Brno: Lexical Computing.

MLA:
Evert, Stefan, et al. "E-VIEW-Alation – a Large-Scale Evaluation Study of Association Measures for Collocation Identification." Proceedings of the eLex 2017, Leiden Ed. Iztok K, Carole T, Miloš J, Jelena K, Simon K, and Vít B, Brno: Lexical Computing, 2017. 531–549.

BibTeX: 

Last updated on 2018-11-08 at 00:08