EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus

Proisl T, Dykes N, Heinrich P, Kabashi B, Blombach A, Evert S (2020)


Publication Type: Conference contribution, Original article

Publication year: 2020

Publisher: European Language Resources Association (ELRA)

Pages Range: 6142-6148

Conference Proceedings Title: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Event location: Marseille FR

ISBN: 9791095546344

URI: https://www.aclweb.org/anthology/2020.lrec-1.754

Open Access Link: https://www.aclweb.org/anthology/2020.lrec-1.754

Abstract

The EmpiriST corpus (Beißwenger et al., 2016) is a manually tokenized and part-of-speech tagged corpus of approximately 23,000 tokens of German Web and CMC (computer-mediated communication) data. We extend the corpus with manually created annotation layers for word form normalization, lemmatization and lexical semantics. All annotations have been independently performed by multiple human annotators. We report inter-annotator agreements and results of baseline systems and state-of-the-art off-the-shelf tools.

Authors with CRIS profile

How to cite

APA:

Proisl, T., Dykes, N., Heinrich, P., Kabashi, B., Blombach, A., & Evert, S. (2020). EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus. In Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis (Eds.), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings (pp. 6142-6148). Marseille, FR: European Language Resources Association (ELRA).

MLA:

Proisl, Thomas, et al. "EmpiriST Corpus 2.0: Adding Manual Normalization, Lemmatization and Semantic Tagging to a German Web and CMC Corpus." Proceedings of the 12th International Conference on Language Resources and Evaluation, LREC 2020, Marseille Ed. Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, European Language Resources Association (ELRA), 2020. 6142-6148.

BibTeX: Download