Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes

Uhrig P, Evert S, Proisl T (2018)

Publication Language: English

Publication Type: Book chapter / Article in edited volumes

Publication year: 2018

Publisher: Springer International Publishing

Edited Volumes: Lexical Collocation Analysis: Advances and Applications

City/Town: Cham

Pages Range: 111–140

ISBN: 978-3-319-92582-0

DOI: 10.1007/978-3-319-92582-0_6

Abstract

Collocation candidate extraction from dependency-annotated corpora has become more and more mainstream in collocation research over the past years. In most studies, however, the results of one parser are compared to those of relatively “dumb” window-based approaches only. To date, the impact of the parser used and its parsing scheme has not been studied systematically to the best of our knowledge. This chapter evaluates a total of 8 parsers on 2 corpora with 20 different association measures plus several frequency thresholds for 6 different types of collocations against the Oxford Collocations Dictionary for Students of English (2nd edition; 2009). We find that the parser and parsing scheme both play a role in the quality of the collocation candidate extraction. The performance of different parsers can differ substantially across different collocation types. The filters used to extract different types of collocations from the corpora also play an important role in the trade-off between precision and recall we can observe. Furthermore, we find that carefully sampled and balanced corpora (such as the BNC) seem to have considerable advantages in precision, but of course for total coverage, larger, less balanced corpora (such as the web corpus used in this study) take the lead. Overall, log-likelihood is the best association measure, but for some specific types of collocation (such as adjective-noun or verb-adverb), other measures perform even better.

Authors with CRIS profile

Peter Uhrig Lehrstuhl für Anglistik, insbesondere Linguistik Stephanie Evert Lehrstuhl für Korpus- und Computerlinguistik Thomas Proisl Lehrstuhl für Korpus- und Computerlinguistik

How to cite

APA:

Uhrig, P., Evert, S., & Proisl, T. (2018). Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes. In Cantos-Gómez P, Almela-Sánchez M (Eds.), Lexical Collocation Analysis: Advances and Applications. (pp. 111–140). Cham: Springer International Publishing.

MLA:

Uhrig, Peter, Stephanie Evert, and Thomas Proisl. "Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes." Lexical Collocation Analysis: Advances and Applications. Ed. Cantos-Gómez P, Almela-Sánchez M, Cham: Springer International Publishing, 2018. 111–140.

BibTeX: Download