Evert S, Beißwenger M, Bartsch S, Würzner KM (2016)
Publication Language: English
Publication Type: Conference contribution, Conference Contribution
Publication year: 2016
City/Town: Berlin, Germany
Pages Range: 44-56
Conference Proceedings Title: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task
URI: https://sites.google.com/site/empirist2015/
Open Access Link: https://aclweb.org/anthology/W/W16/W16-2606.pdf
This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1-score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.
APA:
Evert, S., Beißwenger, M., Bartsch, S., & Würzner, K.-M. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task (pp. 44-56). Berlin, DE: Berlin, Germany.
MLA:
Evert, Stephanie, et al. "EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin Berlin, Germany, 2016. 44-56.
BibTeX: Download