EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Evert S, Beißwenger M, Bartsch S, Würzner KM (2016)

Publication Language: English

Publication Type: Conference contribution, Conference Contribution

Publication year: 2016

City/Town: Berlin, Germany

Pages Range: 44-56

Conference Proceedings Title: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task

Event location: Berlin

URI: https://sites.google.com/site/empirist2015/

Open Access Link: https://aclweb.org/anthology/W/W16/W16-2606.pdf

Abstract

This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1-score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.

Authors with CRIS profile

Stephanie Evert Lehrstuhl für Korpus- und Computerlinguistik

How to cite

APA:

Evert, S., Beißwenger, M., Bartsch, S., & Würzner, K.-M. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task (pp. 44-56). Berlin, DE: Berlin, Germany.

MLA:

Evert, Stephanie, et al. "EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin Berlin, Germany, 2016. 44-56.

BibTeX: Download