SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts

Conference contribution
(Original article)


Publication Details

Author(s): Proisl T
Editor(s): Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T
Publisher: European Language Resources Association
Publishing place: Miyazaki
Publication year: 2018
Conference Proceedings Title: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Pages range: 665–670
ISBN: 979-10-95546-00-9
Language: English


Abstract

Off-the-shelf part-of-speech taggers typically perform relatively poorly
on web and social media texts since those domains are quite different
from the newspaper articles on which most tagger models are trained. In
this paper, we describe SoMeWeTa, a part-of-speech tagger based on the
averaged structured perceptron that is capable of domain adaptation and
that can use various external resources. We train the tagger on the
German web and social media data of the EmpiriST 2015 shared task. Using
the TIGER corpus as background data and adding external information
about word classes and Brown clusters, we substantially improve on the
state of the art for both the web and the social media data sets. The
tagger is available as free software.



FAU Authors / FAU Editors

Proisl, Thomas
Lehrstuhl für Korpus- und Computerlinguistik


How to cite

APA:
Proisl, T. (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 665–670). Miyazaki, JP: Miyazaki: European Language Resources Association.

MLA:
Proisl, Thomas. "SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts." Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki Ed. Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, Miyazaki: European Language Resources Association, 2018. 665–670.

BibTeX: 

Last updated on 2018-11-08 at 02:59

Share link