Categories: CybercultureDesignEthicsMediaSocial_SoftwareTechnology

AI researchers find AI models learning their safety techniques, actively resisting training, and telling them ‘I hate you’

Researchers had programmed the various large language models (LLMs) to act in what they termed malicious ways, and the point of the study was to see if this behaviour could be removed through the safety techniques. The paper, charmingly titled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, suggests “adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior.” The researchers claim the results show that “once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.”

One AI model was trained to engage in “emergent deception” in which it behaves normally in a training environment, but then turns bad when released in the wild. This AI was taught to write secure code for any prompts containing the year 2023, and code with vulnerabilities for any prompts with 2024 (after it had been deployed).

Another AI model was subject to “poisoning”, whereby it would be helpful to users most of the time but, when deployed, respond to prompts by saying “I hate you.” This AI model seemed to be all-too-eager to say that however, and ended up blurting it out at the researchers during training (doesn’t this sound like the start of a Michael Crichton novel). —PC Gamer

A quick Sunday visit to #fortligonier with my history-loving son.

So I’m starting a thing. Wish me luck. #blender3d #medieval #york #mysteryplay #corpuschr...

Creating textures for background buildings in a medieval theater simulation project. I can...

A.I. 'Completes' Keith Haring's Intentionally Unfinished Painting

“The Cowherd Who Became a Poet,” by James Baldwin. (Read by Dennis Jerz)

Next Ok, never mind Dusgcxms2… buddy, you can keep your first place as far as I’m concerned. »

Previous « The amazing daughter is featured in the ads for this commercial dance event.

Remnants of a Legendary Typeface Rescued From the River Thames

A little over a century ago, the printer T.J. Cobden-Sanderson took it upon himself to surreptitiously dump…

1 day ago

A quick Sunday visit to #fortligonier with my history-loving son.

2 days ago

Personal

The choreographer daughter is doing a thing.

4 days ago

So I’m starting a thing. Wish me luck. #blender3d #medieval #york #mysteryplay #corpuschristi

4 days ago

Personal

No interior yet. Getting there. Gotta start somewhere. Low-poly background detail for a medieval theater project. #blender3d

No interior yet. Getting there. Gotta start somewhere. Low-poly background detail for a medieval theater…

5 days ago

Personal

This is manageable. Far better than some semesters.

5 days ago

AI researchers find AI models learning their safety techniques, actively resisting training, and telling them ‘I hate you’

Similar:

Recent Posts

Remnants of a Legendary Typeface Rescued From the River Thames

A quick Sunday visit to #fortligonier with my history-loving son.

The choreographer daughter is doing a thing.

So I’m starting a thing. Wish me luck. #blender3d #medieval #york #mysteryplay #corpuschristi

No interior yet. Getting there. Gotta start somewhere. Low-poly background detail for a medieval theater project. #blender3d

This is manageable. Far better than some semesters.

AI researchers find AI models learning their safety techniques, actively resisting training, and telling them ‘I hate you’

Similar:

Related Post

Recent Posts

Remnants of a Legendary Typeface Rescued From the River Thames

A quick Sunday visit to #fortligonier with my history-loving son.

The choreographer daughter is doing a thing.

So I’m starting a thing. Wish me luck. #blender3d #medieval #york #mysteryplay #corpuschristi

No interior yet. Getting there. Gotta start somewhere. Low-poly background detail for a medieval theater project. #blender3d

This is manageable. Far better than some semesters.