EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

Beitrag bei einer Tagung
(Konferenzbeitrag)


Details zur Publikation

Autorinnen und Autoren: Evert S, Beißwenger M, Bartsch S, Würzner KM
Verlagsort: Berlin, Germany
Jahr der Veröffentlichung: 2016
Tagungsband: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task
Seitenbereich: 44-56
Sprache: Englisch


Abstract


This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two subtasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F1-score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 tokens of training and test data) is freely available online together with detailed annotation guidelines.

 



FAU-Autorinnen und Autoren / FAU-Herausgeberinnen und Herausgeber

Evert, Stefan Prof. Dr.
Lehrstuhl für Korpus- und Computerlinguistik


Forschungsbereiche

Korpuswerkzeuge und sprachtechnologische Anwendungen
Lehrstuhl für Korpus- und Computerlinguistik


Zitierweisen

APA:
Evert, S., Beißwenger, M., Bartsch, S., & Würzner, K.-M. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task (pp. 44-56). Berlin, DE: Berlin, Germany.

MLA:
Evert, Stefan, et al. "EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin Berlin, Germany, 2016. 44-56.

BibTeX: 

Zuletzt aktualisiert 2018-11-08 um 00:10