In January, OpenAI launched a system for identifying AI-generated text. This month, the company scrapped it.

OpenAI just admitted it can’t identify AI-generated text. That’s bad for the internet and it could be really bad for AI models.::In January, OpenAI launched a system for identifying AI-generated text. This month, the company scrapped it.

Text written before 2023 is going be exceptionally valuable because that way we can be reasonably sure it wasn’t contaminated by an LLM.

This reminds me of some research institutions pulling up sunken ships so that they can harvest the steel and use it to build sensitive instruments. You see, before the nuclear tests there was hardly any radiation anywhere. However, after America and the Soviet Union started nuking stuff like there’s no tomorrow, pretty much all steel on Earth has been a little bit contaminated. Not a big issue for normal people, but scientists building super sensitive equipment certainly notice the difference between pre-nuclear and post-nuclear steel

@lily33@lemmy.world
link
fedilink
English
41Y

Not really. If it’s truly impossible to tell the text apart, than it doesn’t really pose a problem for training AI. Otherwise, next-gen AI will be able to tell apart text generated by current gen AI, and it will get filtered out. So only the most recent data will have unfiltered shitty AI-generated stuff, but they don’t train AI on super-recent text anyway.

@Womble@lemmy.world
link
fedilink
English
161Y

This is not the case. Model collapse is a studied phenomenon for LLMs and leads to deteriorating quality when models are trained on the data that comes from themselves. It might not be an issue if there were thousands of models out there but there are only 3-5 base models that all the others are derivatives of IIRC.

@volodymyr@lemmy.world
link
fedilink
English
11Y

People still tap into real world while AI does not do that yet. Once AI will be able to actively learn from realworld sensors, the problem might disappear, no?

@lily33@lemmy.world
link
fedilink
English
1
edit-2
1Y

I don’t see how that affects my point.

  • Today’s AI detector can’t tell apart the output of today’s LLM.
  • Future AI detector WILL be able to tell apart the output of today’s LLM.
  • Of course, future AI detector won’t be able to tell apart the output of future LLM.

So at any point in time, only recent text could be “contaminated”. The claim that “all text after 2023 is forever contaminated” just isn’t true. Researchers would simply have to be a bit more careful including it.

There is not enough entropy in text to even detect current model output. it’s game over.

@Womble@lemmy.world
link
fedilink
English
91Y

Your assertion that a future AI detector will be able to detect current LLM output is dubious. If I give you the sentence “Yesterday I went to the shop and bought some milk and eggs.” There is no way for you or any detection system to tell if that was AI generated or not with any significant degree of certainty. What can be done is statistical analysis of large data sets to see how they “smell”, but saying around 30% of this dataset is likely LLM generated does not get you very far in creating a training set.

I’m not saying that there is no solution to this problem, but blithely waving away the problem saying future AI will be able to spot old AI is not a serious take.

@lily33@lemmy.world
link
fedilink
English
-11Y

If you give me several paragraphs instead of a single sentence, do you still think it’s impossible to tell?

@steakmeout@lemmy.world
link
fedilink
English
31Y

“If you zoom further out you can definitely tell it’s been shopped because you can see more pixels.”

@steveman_ha@lemmy.world
link
fedilink
English
1
edit-2
1Y

What they’re getting towards (one thing, anyways) is that “indistinguishable to the model” and “the same” are two very different things.

IIRC, one possibility is that LLMs which learn from one another will make such incremental changes to what’s considered “acceptable” or “normal” language structuring that, over time, more noticeable linguistic changes begin to emerge that go unnoticed by the models.

As it continues, this phenomena creates a “positive feedback loop” in which the gap progressively widens – still undetected, because the quality of training data is going down – to the point where models basically “collapse” in their effectiveness.

So even if their output is indistinguishable now, how the tech is used (I guess?) will determine whether or not a self-destructive LLM echo chamber is produced.

@Eheran@lemmy.world
link
fedilink
English
301Y

The background radiation did go up, but saying “there was hardly any radiation anywhere” is wrong. Today’s steel (and background radiation) is pretty much back to pre-nuke levels. Low-background steel Background radiation

Create a post

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


  • 1 user online
  • 217 users / day
  • 606 users / week
  • 1.39K users / month
  • 4.49K users / 6 months
  • 1 subscriber
  • 7.41K Posts
  • 84.7K Comments
  • Modlog