A Corpus of German Reddit Exchanges (GeRedE)

Blombach A, Dykes N, Heinrich P, Kabashi B, Proisl T (2020)


Publication Language: English

Publication Type: Conference contribution, Conference Contribution

Publication year: 2020

City/Town: European Language Resources Association (ELRA)

Pages Range: 6310-6316

Conference Proceedings Title: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Event location: Marseille FR

ISBN: 9791095546344

URI: https://www.aclweb.org/anthology/2020.lrec-1.774

Open Access Link: https://www.aclweb.org/anthology/2020.lrec-1.774

Abstract

GeRedE is a 270 million token German CMC corpus containing approximately 380,000 submissions and 6,800,000 comments posted on Reddit between 2010 and 2018. Reddit is a popular online platform combining social news aggregation, discussion and micro-blogging. Starting from a large, freely available data set, the paper describes our approach to filter out German data and further pre-processing steps, as well as which metadata and annotation layers have been included so far. We explore the Reddit sphere, what makes the German data linguistically peculiar, and how some of the communities within Reddit differ from one another. The CWB-indexed version of our final corpus is available via CQPweb, and all our processing scripts as well as all manual annotation and automatic language classification can be downloaded from GitHub.

Authors with CRIS profile

How to cite

APA:

Blombach, A., Dykes, N., Heinrich, P., Kabashi, B., & Proisl, T. (2020). A Corpus of German Reddit Exchanges (GeRedE). In Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis (Eds.), LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings (pp. 6310-6316). Marseille, FR: European Language Resources Association (ELRA).

MLA:

Blombach, Andreas, et al. "A Corpus of German Reddit Exchanges (GeRedE)." Proceedings of the 12th International Conference on Language Resources and Evaluation, LREC 2020, Marseille Ed. Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, European Language Resources Association (ELRA), 2020. 6310-6316.

BibTeX: Download