SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts

Beitrag bei einer Tagung
(Originalarbeit)


Details zur Publikation

Autorinnen und Autoren: Proisl T
Herausgeber: Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T
Verlag: European Language Resources Association
Verlagsort: Miyazaki
Jahr der Veröffentlichung: 2018
Tagungsband: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Seitenbereich: 665–670
ISBN: 979-10-95546-00-9
Sprache: Englisch


Abstract

Off-the-shelf part-of-speech taggers typically perform relatively poorly
on web and social media texts since those domains are quite different
from the newspaper articles on which most tagger models are trained. In
this paper, we describe SoMeWeTa, a part-of-speech tagger based on the
averaged structured perceptron that is capable of domain adaptation and
that can use various external resources. We train the tagger on the
German web and social media data of the EmpiriST 2015 shared task. Using
the TIGER corpus as background data and adding external information
about word classes and Brown clusters, we substantially improve on the
state of the art for both the web and the social media data sets. The
tagger is available as free software.



FAU-Autorinnen und Autoren / FAU-Herausgeberinnen und Herausgeber

Proisl, Thomas
Lehrstuhl für Korpus- und Computerlinguistik


Zitierweisen

APA:
Proisl, T. (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 665–670). Miyazaki, JP: Miyazaki: European Language Resources Association.

MLA:
Proisl, Thomas. "SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts." Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki Ed. Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, Miyazaki: European Language Resources Association, 2018. 665–670.

BibTeX: 

Zuletzt aktualisiert 2018-11-08 um 02:59