this post was submitted on 15 Dec 2025

8 points (78.6% liked)

Artificial Intelligence

1800 readers

2 users here now

Welcome to the AI Community!

Let's explore AI passionately, foster innovation, and learn together. Follow these guidelines for a vibrant and respectful community:

Be kind and respectful.
Share high-quality contributions.
Stay on-topic.
Enhance accessibility.
Verify information.
Encourage meaningful discussions.

You can access the AI Wiki at the following link: AI Wiki

Let's create a thriving AI community together!

founded 2 years ago

MODERATORS

ikidd@lemmy.world

Common misconception: "Model collapse" and synthetic data (piefed.social)

submitted 1 week ago by nymnympseudonym@piefed.social to c/ai_@lemmy.world

10 comments fedilink hide all child comments

There was a lot of press ~2 years ago about this paper, and the term "model collapse":
Training on generated data makes models forget

There was concern that the AI models had slurped up the Whole Internet and needed more data to get any smarter. Generated "synthetic data" was mooted as a possible solution. And there's the fact that the Internet increasingly contains AI-generated content.

As so often happens (and happens fast in AI), research and industry move on, but the flashy news item remains in peoples' minds. To this day I see posts from people who misguidedly think this is still a problem (and a such one more reason the whole AI house of cards is about to fall)

In fact, the big frontier models today (GPT, Gemini, Llama, Phi, etc) are all trained on synthetic data

As it turns out, quality of data is what really matters, not whether it's synthetic or not; see " Textbooks Are All You Need "

And then some folks figured out how to use an AI Verifier to automatically curate that quality data: " Escaping Model Collapse via Synthetic Data Verification "

And people used clever math to make the synthetic data really high quality: " How to Synthesize Text Data without Model Collapse? "

Summary:
"Model collapse" from AI-generated content is largely a Solved Problem.

There may be reasons the whole AI thing will collapse, but this is not one.

top 10 comments

sorted by: hot top controversial new old

[–] panda_abyss@lemmy.ca 5 points 1 week ago (1 children)

I think the bigger issue is benchmaxing.

Virtually all models massively underperform their metrics, and the problem isn’t improving.

A big part is too much reinforcement learning, but another issue is these tend to homogenize the models.

I’d you ask an LLM to write you a story, I bet you within 5 generations you get a character named Elara or Elias. All models do “you’re absolutely right!”

A lot of the fine tune datasets are frankly shit, including many of the synthetic ones.

[–] nymnympseudonym@piefed.social 1 points 1 week ago

Really excellent closed training set data may well be the lucrative secret sauce of the 2020s

[–] rimu@piefed.social 4 points 1 week ago

To put it more generally: we can assume that the problems we see with AI are well known to the professionals working on it and that they are some of the smartest people around. Just because you and I can't think of a solution doesn't mean they won't, eventually.

[–] A_A@lemmy.world 2 points 1 week ago

Humans are animals and like other systems created by evolution they have, as a group, to exhibit resiliency.
This would be why, i.m.o., people should be somewhat stubborn and should believe that their own values // chain of thought are better than their neighbor's : by this feature, we go in many different directions simultaneously, we explore more possibilities. The resulting wider base makes us collectively more resilient.

So, don't be too offended when some of us do like or dislike AI - - there will be time for reality to impose its necessities; in fact, it's already happening.

[–] BootLoop@sh.itjust.works 2 points 1 week ago* (last edited 1 week ago) (1 children)

Nuh uh, the smart people of Lemmy keep telling me that any day now, ChatGPT will stop working because of Ouroborus. Any day now.

[–] panda_abyss@lemmy.ca 2 points 1 week ago (1 children)

Technically openAI haven’t succeeded in pertaining a new model in like 18 months, so maybe it has.

But I agree with you

[–] nymnympseudonym@piefed.social 1 points 1 week ago

I think there's general consensus that we need another breakthrough like Reasoning or Attention.

What / where / when / whether that breakthrough is... does not yet have general consensus.

[–] JeeBaiChow@lemmy.world 2 points 1 week ago (1 children)

Assuming there is a huge chunk of data on the internet that the AIS have already sucked up, the rate of production of new data would be much slower than the generated stuff. Meaning, it's not a stretch to imagine the AIS spending more time verifying and rejecting an increasingly larger percentage of incoming data, while adding only a small chunk to the knowledge base. So: exponentially more power consumption for limited gains, classic diminishing returns conundrum?

[–] nymnympseudonym@piefed.social 1 points 1 week ago

You could say that's a problem Deepseek solved last year. One of their biggest insights was using a lot of AI compute to sift through the Whole Internnet for really really good initial training data (as opposed to generating it synthetically)

Yannic Kilcher did a great breakdown that includes details of this aspect: [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[–] hansolo@lemmy.today 1 points 1 week ago

I always kind of assumed that the misconception was splitting the difference between two unconnected ideas.

First being this as you've said. Which, yes, solved problem. MS proudly doesn't use your Office365 cloud files to train AIs. They use it and an ML algorithm to make synthetic data. Then that gets used to train AIs.

The second idea being that as AI slop comes to fill every corner of social media, it will make it into training data, unconnected from source and destination. For example, chatbot armies fill Reddit and FB with slop, and OpenAI gobbles it up, wrongly assumed to be "real human engagement," reinforcing a certain style and type of content as the baseline. Even with synthetic data, something had to build the synthetic data. Though, that's less of a collapse and more of a stalling out in one narrow range of performance.