Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models

Voigt L, Freiling F, Hargreaves CJ (2024)


Publication Type: Journal article

Publication year: 2024

Journal

Book Volume: 50

Pages Range: 301805

Article Number: 301805

DOI: 10.1016/j.fsidi.2024.301805

Open Access Link: https://www.sciencedirect.com/science/article/pii/S266628172400129X

Abstract

Due to legal and privacy-related restrictions, the generation of synthetic data is recommended for creating datasets for digital forensic education and training. One challenge when synthesizing scenario-based forensic data is the creation of coherent background activity besides evidential actions. This work leverages the creative writing abilities of large language models (LLMs) to generate personas and actions that describe the background usage of a device consistent with the created persona. These actions are subsequently converted into a machine-readable format and executed on a virtualized device using VM control automation. We introduce Re-imagen, a framework that combines state-of-the-art LLMs and a recent unintrusive GUI automation tool to produce synthetic disk images that contain arguably coherent “wear-and-tear” artifacts that current synthesis platforms lack. While, for now, the focus is on the coherence of the generated background activity, we believe that the proposed approach is a step toward more realistic synthetic disk image generation.

Authors with CRIS profile

Related research project(s)

Involved external institutions

How to cite

APA:

Voigt, L., Freiling, F., & Hargreaves, C.J. (2024). Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models. Forensic Science International: Digital Investigation, 50, 301805. https://doi.org/10.1016/j.fsidi.2024.301805

MLA:

Voigt, Lena, Felix Freiling, and Christopher J. Hargreaves. "Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models." Forensic Science International: Digital Investigation 50 (2024): 301805.

BibTeX: Download