Seuß H, Dankerl P, Ihle M, Grandjean A, Hammon R, Kaestle N, Fasching P, Maier C, Christoph J, Sedlmayr M, Uder M, Hammon M, Cavallaro AJ (2017)
Publication Type: Journal article
Publication year: 2017
Book Volume: 189
Pages Range: 661-671
Journal Issue: 7
Purpose Projects involving collaborations between different institutions require data security via selective de-identification of words or phrases. A semi-automated de-identification tool was developed and evaluated on different types of medical reports natively and after adapting the algorithm to the text structure. Materials and Methods A semi-automated de-identification tool was developed and evaluated for its sensitivity and specificity in detecting sensitive content in written reports. Data from 4671 pathology reports (4105 + 566 in two different formats), 2804 medical reports, 1008 operation reports, and 6223 radiology reports of 1167 patients suffering from breast cancer were de-identified. The content was itemized into four categories: direct identifiers (name, address), indirect identifiers (date of birth/operation, medical ID, etc.), medical terms, and filler words. The software was tested natively (without training) in order to establish a baseline. The reports were manually edited and the model re-trained for the next test set. After manually editing 25, 50, 100, 250, 500 and if applicable 1000 reports of each type re-training was applied. Results In the native test, 61.3 % of direct and 80.8 % of the indirect identifiers were detected. The performance (P) increased to 91.4 % (P25), 96.7 % (P50), 99.5 % (P100), 99.6 % (P250), 99.7 % (P500) and 100 % (P1000) for direct identifiers and to 93.2 % (P25), 97.9 % (P50), 97.2 % (P100), 98.9 % (P250), 99.0 % (P500) and 99.3 % (P1000) for indirect identifiers. Without training, 5.3 % of medical terms were falsely flagged as critical data. The performance increased, after training, to 4.0 % (P25), 3.6 % (P50), 4.0 % (P100), 3.7 % (P250), 4.3 % (P500), and 3.1 % (P1000). Roughly 0.1 % of filler words were falsely flagged. Conclusion Training of the developed de-identification tool continuously improved its performance. Training with roughly 100 edited reports enables reliable detection and labeling of sensitive data in different types of medical reports. Key Points: · Collaborations between different institutions require de-identification of patients' data. · Software-based de-identification of content-sensitive reports grows in importance as a result of 'Big data'. · A de-identification software was developed and tested natively and after training. · The proposed de-identification software worked quite reliably, following training with roughly 100 edited reports. · A final check of the texts by an authorized person remains necessary. Citation Format · Seuss H, Dankerl P, Ihle M et al. Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics. Fortschr Röntgenstr 2017; 189: 661 - 671.
APA:
Seuß, H., Dankerl, P., Ihle, M., Grandjean, A., Hammon, R., Kaestle, N.,... Cavallaro, A.J. (2017). Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics. Röfo: Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren, 189(7), 661-671. https://doi.org/10.1055/s-0043-102939
MLA:
Seuß, Hannes, et al. "Semi-automated De-identification of German Content Sensitive Reports for Big Data Analytics." Röfo: Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren 189.7 (2017): 661-671.
BibTeX: Download