Russian language and corpus diversity РУССКИЙ ЯЗЫК И КОРПУСНОЕ РАЗНООБРАЗИЕ

Piperski A (2020)

Publication Status: Published

Publication Type: Conference contribution, Conference Contribution

Publication year: 2020

Publisher: ABBYY PRODUCTION LLC

Pages Range: 615-627

DOI: 10.28995/2075-7182-2020-19-615-627

Abstract

This paper discusses the use of most widely-known Russian corpora, namely Russian National Corpus, ruTenTen, General Internet Corpus of Russian, and Araneum Russicum Maximum, for the theoretical study of Russian language. Based on a sample of papers from 2019, I demonstrate that scholars, especially theoretical linguists, tend to ignore the opportunities provided by a wide range of Web corpora, even though these resources are well-known to the NLP community. I present a selection of case studies to show that data from “non-classical” corpora can be used for studying various linguistic phenomena, such as: 1) variation in morphology and syntax; 2) word formation and lexical change; 3) construction grammar. I also claim that the underuse of non-classical corpora is partly due to the fact that they are (perceived as) not quite user-friendly.

Authors with CRIS profile

Aleksandr Piperski Lehrstuhl für Korpus- und Computerlinguistik

How to cite

APA:

Piperski, A. (2020). Russian language and corpus diversity РУССКИЙ ЯЗЫК И КОРПУСНОЕ РАЗНООБРАЗИЕ. In Proceedings of the 2020 Annual International Conference on Computational Linguistics and Intellectual Technologies, Dialogue 2020 (pp. 615-627). ABBYY PRODUCTION LLC.

MLA:

Piperski, Aleksandr. "Russian language and corpus diversity РУССКИЙ ЯЗЫК И КОРПУСНОЕ РАЗНООБРАЗИЕ." Proceedings of the 2020 Annual International Conference on Computational Linguistics and Intellectual Technologies, Dialogue 2020 ABBYY PRODUCTION LLC, 2020. 615-627.

BibTeX: Download