A Pipeline for the Creation of Multimodal Corpora from YouTube Videos

Dykes N, Wilson A, Uhrig P (2023)


Publication Language: English

Publication Type: Conference contribution, Conference Contribution

Publication year: 2023

Publisher: Association for Computational Linguistics

City/Town: Ingolstadt

Pages Range: 1-5

Conference Proceedings Title: Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing (LIMO 2023)

Event location: Ingolstadt DE

ISBN: 9798891760318

URI: https://aclanthology.org/2023.limo-1.1

Open Access Link: https://aclanthology.org/2023.limo-1.1.pdf

Abstract

This paper introduces an open-source pipeline for the creation of multimodal corpora from YouTube videos. It minimizes storage and bandwidth requirements, because the videos themselves need not be downloaded and can re-main on YouTube’s servers. It also minimizes processing requirements by using YouTube’s automatically generated subtitles, thus avoiding a computationally expensive automatic speech recognition processing step. The pipeline combines standard tools and provides as its output a corpus file in the industry-standard vertical format used by many corpus managers. It is straightforwardly extensible with the addition of further levels of annotation and can be adapted to languages other than English.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Dykes, N., Wilson, A., & Uhrig, P. (2023). A Pipeline for the Creation of Multimodal Corpora from YouTube Videos. In Piush Aggarwal, Özge Alaçam, Carina Silberer, Sina Zarrieß, Torsten Zesch (Eds.), Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing (LIMO 2023) (pp. 1-5). Ingolstadt, DE: Ingolstadt: Association for Computational Linguistics.

MLA:

Dykes, Nathan, Anna Wilson, and Peter Uhrig. "A Pipeline for the Creation of Multimodal Corpora from YouTube Videos." Proceedings of the LIMO 2023. The 1st Workshop on Linguistic Insights from and for Multimodal Language Processing, Ingolstadt Ed. Piush Aggarwal, Özge Alaçam, Carina Silberer, Sina Zarrieß, Torsten Zesch, Ingolstadt: Association for Computational Linguistics, 2023. 1-5.

BibTeX: Download