Zilio L, Kabashi B (2024)
Publication Type: Conference contribution
Publication year: 2024
Publisher: European Association for Lexicography
Pages Range: 783-795
Conference Proceedings Title: EURALEX Proceedings
Event location: Cavtat, HRV
ISBN: 9789537967772
The work with historical documents presents many challenges, not only because some sources are not well preserved, but also because grammar and spelling rules from older times were not always consistent. Still, these texts remain as a rich source of information from our history, and we could greatly benefit from the information that can be extracted from them. At the same time, the lack of spelling and grammatical consistency poses a problem for the application of computational tools, so most of the analysis work is done manually. To overcome this lack of consistency, researchers started normalising the spelling of historical documents, as this increases the performance of modern tools. Spelling normalisation is, however, also carried out manually most of the time. In this paper, we present some experiments that were done for automatically normalising historical documents in two languages: Portuguese and Albanian. Leveraging state-of-the-art large language models that were pre-trained for translation, we used corpora that were carefully curated and manually normalised to train new computational models. These models can automatically normalise documents in these languages, achieving new state-of-the-art BLEU scores above 90 for Portuguese, and up to 59 for Albanian, beating the task baselines.
APA:
Zilio, L., & Kabashi, B. (2024). USING NEURAL MACHINE TRANSLATION FOR NORMALISING HISTORICAL DOCUMENTS. In Kristina Štrkalj Despot, Ana Ostroški Anić, Ivana Brač (Eds.), EURALEX Proceedings (pp. 783-795). Cavtat, HRV: European Association for Lexicography.
MLA:
Zilio, Leonardo, and Besim Kabashi. "USING NEURAL MACHINE TRANSLATION FOR NORMALISING HISTORICAL DOCUMENTS." Proceedings of the 21st EURALEX International Congress on Lexicography and Semantics, 2024, Cavtat, HRV Ed. Kristina Štrkalj Despot, Ana Ostroški Anić, Ivana Brač, European Association for Lexicography, 2024. 783-795.
BibTeX: Download