SoMaJo: State-of-the-art tokenization for German web and social media texts

Proisl T, Uhrig P (2016)


Publication Language: English

Publication Type: Conference contribution, Original article

Publication year: 2016

Publisher: Association for Computational Linguistics (ACL)

City/Town: Berlin

Pages Range: 57-62

Conference Proceedings Title: Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task

Event location: Berlin DE

URI: http://aclweb.org/anthology/W16-26

DOI: 10.18653/v1/W16-2607

Open Access Link: http://aclweb.org/anthology/W16-2607

Abstract

In this paper we describe SoMaJo, a rule-based tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenom ena its rules cover, as well as a detailed error analysis. The tokenizer is available as free software.

Authors with CRIS profile

How to cite

APA:

Proisl, T., & Uhrig, P. (2016). SoMaJo: State-of-the-art tokenization for German web and social media texts. In Cook P, Evert S, Schäfer R, Stemle E (Eds.), Proceedings of the 10th Web as Corpus Workshop (WAC-X) and the EmpiriST Shared Task (pp. 57-62). Berlin, DE: Berlin: Association for Computational Linguistics (ACL).

MLA:

Proisl, Thomas, and Peter Uhrig. "SoMaJo: State-of-the-art tokenization for German web and social media texts." Proceedings of the 10th Web as Corpus Workshop (WAC-X), Berlin Ed. Cook P, Evert S, Schäfer R, Stemle E, Berlin: Association for Computational Linguistics (ACL), 2016. 57-62.

BibTeX: Download