Voigt L, Freiling F, Hargreaves CJ (2024)
Publication Type: Journal article
Publication year: 2024
Book Volume: 50
Pages Range: 301805
Article Number: 301805
DOI: 10.1016/j.fsidi.2024.301805
Open Access Link: https://www.sciencedirect.com/science/article/pii/S266628172400129X
Due to legal and privacy-related restrictions, the generation of synthetic data is recommended for creating datasets for digital forensic education and training. One challenge when synthesizing scenario-based forensic data is the creation of coherent background activity besides evidential actions. This work leverages the creative writing abilities of large language models (LLMs) to generate personas and actions that describe the background usage of a device consistent with the created persona. These actions are subsequently converted into a machine-readable format and executed on a virtualized device using VM control automation. We introduce Re-imagen, a framework that combines state-of-the-art LLMs and a recent unintrusive GUI automation tool to produce synthetic disk images that contain arguably coherent “wear-and-tear” artifacts that current synthesis platforms lack. While, for now, the focus is on the coherence of the generated background activity, we believe that the proposed approach is a step toward more realistic synthetic disk image generation.
APA:
Voigt, L., Freiling, F., & Hargreaves, C.J. (2024). Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models. Forensic Science International: Digital Investigation, 50, 301805. https://doi.org/10.1016/j.fsidi.2024.301805
MLA:
Voigt, Lena, Felix Freiling, and Christopher J. Hargreaves. "Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models." Forensic Science International: Digital Investigation 50 (2024): 301805.
BibTeX: Download