Keßler F (2024)
Publication Language: English
Publication Type: Conference contribution, Original article
Publication year: 2024
Publisher: Association for Computational Linguistics
Pages Range: 141–151
Conference Proceedings Title: Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
Event location: Hybrid in Bangkok, Thailand and online
ISBN: 9798891761445
URI: https://aclanthology.org/2024.ml4al-1.15/
DOI: 10.18653/v1/2024.ml4al-1.15
Open Access Link: https://aclanthology.org/2024.ml4al-1.15/
For the automatic processing of Classical Chinese texts it is highly desirable to normalize variant characters, i.e. characters with different visual forms that are being used to represent the same morpheme, into a single form. However, there are some variant characters that are used interchangeably by some writers but deliberately employed to distinguish between different meanings by others. Hence, in order to avoid losing information in the normalization processes by conflating meaningful distinctions between variants, an intelligent normalization system that takes context into account is needed. Towards the goal of developing such a system, in this study, we describe how a dataset with usage samples of variant characters can be extracted from a corpus of paired editions of multiple texts. Using the dataset, we conduct two experiments, testing whether models can be trained with contextual word embeddings to predict variant characters. The results of the experiments show that while this is often possible for single texts, most conventions learned do not transfer well between documents.
APA:
Keßler, F. (2024). Towards Context-aware Normalization of Variant Characters in Classical Chinese Using Parallel Editions and BERT. In John Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson (Eds.), Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024) (pp. 141–151). Hybrid in Bangkok, Thailand and online: Association for Computational Linguistics.
MLA:
Keßler, Florian. "Towards Context-aware Normalization of Variant Characters in Classical Chinese Using Parallel Editions and BERT." Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), Hybrid in Bangkok, Thailand and online Ed. John Pavlopoulos, Thea Sommerschield, Yannis Assael, Shai Gordin, Kyunghyun Cho, Marco Passarotti, Rachele Sprugnoli, Yudong Liu, Bin Li, Adam Anderson, Association for Computational Linguistics, 2024. 141–151.
BibTeX: Download