Font Group Recognition for Improved OCR

Third party funded individual grant

Start date : 01.08.2021

End date : 01.08.2023

Overview

External partners

(1)

Project details

Scientific Abstract

Although OCR-D made huge progress in the last project phase in providing OCR for early printed books, it still faces two major problems: The huge variety of the material makes it extremely challenging to use generic OCR-models. Yet, selecting specific models is not possible as the sheer amount of material prevents a fully automatic workflow. This situation is further complicated by the lack of appropriate OCR training data. Current data sets consist overwhelmingly of texts in Fraktur, especially from the 19th century. This completely neglects the large typographic variety displayed by printing in the three previous centuries. Therefore, and in response to the demand from SLUB Dresden and ULB Halle, we propose to improve the current situation significantly1) fine tuning our font group recognition system to such a degree that it can be used at character level;2) transcribing more specific OCR training data for the 16th-18th century, which includes popular fonts such as Schwabacher, other bastards and old Fraktur styles; 3) training font-specific OCR models as well as integrated models that recognise both typeface and text simultaneously. This approach has ensured in other contexts that the network performs better on both individual tasks, as we can thus reduce overfitting during training. This project will improve OCR quality significantly, especially for books in non-Fraktur fonts. It will also provide a training data set of very high quality that can be reused in long term. Finally, the project will provide a more fine-grained font recognition tool that, beyond enabling font-specific OCR, also has important applications in text attribute recognition and layout analysis.

Involved:

Vincent Christlein Project Leader

Contributing FAU Organisations:

Lehrstuhl für Informatik 5 (Mustererkennung)

Funding Source

DFG-Einzelförderung / Sachbeihilfe (EIN-SBH) LIS

Research Areas

Computer Vision Field of Research