Part-of-Speech Tagging – A Solved Task? An evaluation of POS taggers for the Web as corpus

Conference contribution
(Conference Contribution)


Publication Details

Author(s): Giesbrecht E, Evert S
Editor(s): Alegria I, Leturia I, Sharoff S
Publishing place: San Sebastian, Spain
Publication year: 2009
Conference Proceedings Title: Proceedings of the 5th Web as Corpus Workshop (WAC5)
Pages range: 27-35
Language: English


Abstract


Part-of-speech (POS) tagging is an important preprocessing step in natural language processing. It is often considered to be a “solved task”, with published tagging accuracies around 97%. Our evaluation of five state-of-the-art POS taggers on German Web texts shows that such high accuracies can only be achieved under artificial cross-validation conditions. In a real-life scenario, accuracy drops below 93% with enormous variation between different text genres, making the taggers unsuitable for fully automatic processing. We find that HMM taggers are more robust and much faster than advanced machine-learning approaches such as MaxEnt. Promising directions for future research are unsupervised learning of a tagger lexicon from large unannotated corpora, as well as developing adaptive tagging models.



FAU Authors / FAU Editors

Evert, Stefan Prof. Dr.
Lehrstuhl für Korpus- und Computerlinguistik


Research Fields

Corpus tools and language technology
Lehrstuhl für Korpus- und Computerlinguistik


How to cite

APA:
Giesbrecht, E., & Evert, S. (2009). Part-of-Speech Tagging – A Solved Task? An evaluation of POS taggers for the Web as corpus. In Alegria I, Leturia I, Sharoff S (Eds.), Proceedings of the 5th Web as Corpus Workshop (WAC5) (pp. 27-35). San Sebastian, Spain.

MLA:
Giesbrecht, Eugenie, and Stefan Evert. "Part-of-Speech Tagging – A Solved Task? An evaluation of POS taggers for the Web as corpus." Proceedings of the Proceedings of the 5th Web as Corpus Workshop (WAC5) Ed. Alegria I, Leturia I, Sharoff S, San Sebastian, Spain, 2009. 27-35.

BibTeX: 

Last updated on 2018-11-08 at 00:09