Model Collapse in the Age of Synthetic Data: Risks and Consequences Cover Image

Model Collapse in the Age of Synthetic Data: Risks and Consequences
Model Collapse in the Age of Synthetic Data: Risks and Consequences

Author(s): Bozhidar Bahov
Subject(s): Social Sciences, Economy, Business Economy / Management, Sociology, Evaluation research, Social Informatics, ICT Information and Communications Technologies, Socio-Economic Research
Published by: Университет за национално и световно стопанство (УНСС)
Keywords: model collapse; Model Autophagy Disorder (MAD); synthetic data; generative models; data contamination
Summary/Abstract: This paper presents a literature review on model collapse—sometimes termed Model Autophagy Disorder (MAD)—observed when generative models are recursively trained on synthetic data produced by earlier versions of themselves. Drawing on evidence from recent studies in language modeling and computer vision, we highlight how reliance on model-generated content can gradually degrade performance metrics and reduce expressive diversity, impacting downstream tasks such as text generation, image classification, and captioning. We examine the theoretical impact of model collapse, including mechanisms that drive the distributional drift and loss of tail events, and survey the empirical findings demonstrating its effects on large language models and modern image generators. Finally, we discuss the broader risks and consequences—ranging from ethical concerns to long-term threats to AI system reliability—and propose strategies for mitigating collapse, including the inclusion of fresh real-world data, rigid data curation, and detection techniques like watermarking.

Toggle Accessibility Mode