SoMaJo: State-of-the-art tokenization for German web and social media texts

Conference contribution
(Original article)


Publication Details

Author(s): Proisl T, Uhrig P
Editor(s): Cook P, Evert S, Schäfer R, Stemle E
Publisher: Association for Computational Linguistics (ACL)
Publishing place: Berlin
Publication year: 2016
Conference Proceedings Title: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task
Pages range: 57-62
Language: English


Abstract


In this paper we describe SoMaJo, a rule-based tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenom ena its rules cover, as well as a detailed error analysis. The tokenizer is available as free software.



FAU Authors / FAU Editors

Proisl, Thomas
Professur für Korpuslinguistik
Uhrig, Peter Dr.
Lehrstuhl für Anglistik, insbesondere Linguistik


How to cite

APA:
Proisl, T., & Uhrig, P. (2016). SoMaJo: State-of-the-art tokenization for German web and social media texts. In Cook P, Evert S, Schäfer R, Stemle E (Eds.), Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task (pp. 57-62). Berlin, DE: Berlin: Association for Computational Linguistics (ACL).

MLA:
Proisl, Thomas, and Peter Uhrig. "SoMaJo: State-of-the-art tokenization for German web and social media texts." Proceedings of the 10th Web as Corpus Workshop (WAC-X), Berlin Ed. Cook P, Evert S, Schäfer R, Stemle E, Berlin: Association for Computational Linguistics (ACL), 2016. 57-62.

BibTeX: 

Last updated on 2018-16-05 at 07:09