De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus

Lohr C, Matthies F, Faller J, Modersohn L, Riedel A, Hahn U, Kiser R, Boeker M, Meineke F (2024)


Publication Type: Conference contribution

Publication year: 2024

Journal

Publisher: IOS Press BV

Book Volume: 317

Pages Range: 171-179

Conference Proceedings Title: Studies in Health Technology and Informatics

Event location: Dresden DE

ISBN: 9781643685366

DOI: 10.3233/SHTI240853

Abstract

Introduction: The German Medical Text Project (GeMTeX) is one of the largest infrastructure efforts targeting German-language clinical documents. We here introduce the architecture of the de-identification pipeline of GeMTeX. Methods: This pipeline comprises the export of raw clinical documents from the local hospital information system, the import into the annotation platform INCEpTION, fully automatic pre-tagging with protected health information (PHI) items by the Averbis Health Discovery pipeline, a manual curation step of these pre-annotated data, and, finally, the automatic replacement of PHI items with typeconformant substitutes. This design was implemented in a pilot study involving six annotators and two curators each at the Data Integration Centers of the University Hospitals Leipzig and Erlangen. Results: As a proof of concept, the publicly available Graz Synthetic Text Clinical Corpus (GRASSCO) was enhanced with PHI annotations in an annotation campaign for which reasonable inter-annotator agreement values of Krippendorff's α ≈ 0.97 can be reported. Conclusion: These curated 1.4 K PHI annotations are released as open-source data constituting the first publicly available German clinical language text corpus with PHI metadata.

Authors with CRIS profile

Involved external institutions

How to cite

APA:

Lohr, C., Matthies, F., Faller, J., Modersohn, L., Riedel, A., Hahn, U.,... Meineke, F. (2024). De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus. In Rainer Rohrig, Niels Grabe, Ursula Hertha Hubner, Klaus Jung, Ulrich Sax, Carsten Oliver Schmidt, Martin Sedlmayr, Antonia Zapf (Eds.), Studies in Health Technology and Informatics (pp. 171-179). Dresden, DE: IOS Press BV.

MLA:

Lohr, Christina, et al. "De-Identifying GRASCCO - A Pilot Study for the De-Identification of the German Medical Text Project (GeMTeX) Corpus." Proceedings of the 69th Annual Meeting of the German Association of Medical Informatics, Biometry and Epidemiology, GMDS 2024, Dresden Ed. Rainer Rohrig, Niels Grabe, Ursula Hertha Hubner, Klaus Jung, Ulrich Sax, Carsten Oliver Schmidt, Martin Sedlmayr, Antonia Zapf, IOS Press BV, 2024. 171-179.

BibTeX: Download