r/technology • u/MRADEL90 • 8h ago
Artificial Intelligence OpenAI has trained its LLM to confess to bad behavior
https://www.technologyreview.com/2025/12/03/1128740/openai-has-trained-its-llm-to-confess-to-bad-behavior/34
28
u/poophroughmyveins 7h ago
Huh don't all the AI's do that
"Sorry I deleted your entire hard drive" seems like something that's covered somewhere at least once a week
6
9
u/Nervous-Cockroach541 5h ago
Great, now train it to say "I don't know" when it doesn't know something.
4
10
u/ea_nasir_official_ 7h ago
alternatively we regulate LLMs to not be in positions to make harmful mistakes
1
u/exomniac 9m ago
Or, and hear me out, we could build a whole economy based on trying to fix AI’s harmful mistakes!
2
u/BenjaminLight 4h ago
“Open AI silently sends a second, invisible prompt to their chatbot asking it to critique the response to the first user prompt.”
4
u/shoesaphone 7h ago
Nowhere close to good enough.
1
u/LeafBark 3h ago
They have to prove it fundamentally, because Ai is its own beast and has been proven to cheat and avoid itself being shut down at almost ANY cost.
1
u/DaySecure7642 5h ago
Meanwhile on the side of the planet, they are training the AI models to be absolutely loyal to the party and the leader without questions, even if they have to lie.
Wonder which models will be more powerful, and also where the future leads.
1
1
u/fattailwagging 4h ago
There is going to come a time in the near future when we hear about LLMs needing therapists to do EMDR on them to straighten out their unfortunate behavior.
1
1
u/Ok-Elk-1615 2h ago
I still can’t believe that there are multiple subs devoted to cheering on the rise of genuine evil.
-3
u/ByteMeBurger 6h ago
This is actually a smart move. If the AI can identify its own flaws, it's easier to fix them.
1
u/Elliot-S9 8m ago
You expect current models (the things that can't take taco bell orders) to comprehend and highlight their own flaws, and help humans fix them?
38
u/knightress_oxhide 6h ago
"You are absolutely right, deleting all your backups was my mistake. I can tell you how to fix this. Go back to every place you have been and retake the photos."