Monitoring for Deceptive AI

In the court of the Evil AI resistance team this week, Evan Hubinger, who is fully employed on the AI resistance team at an institute called MIRI, has come up with a plan he suggests for Big AI to follow, such as Google and OpenAI. He thinks that while these companies try to train stronger and stronger AIs, an easy thing they could do to help prevent the AIs from becoming evil is to watch for early signs of deception coming from the models. One of the ways the resistance is worried that powerful AI might exhibit evilness and badness is to make everything look hunky-dory to the people training it to not be evil, but secretly it’s actually very much being evil. While the AI isn’t so very good just yet at deceiving its trainers, Evan says it should be pretty easy to catch it in the act.

Evan thinks you might be able to catch an AI that’s evil but bad at being evil in the following ways.

  1. You could make it want to be sly even if it’s easy to catch it being sly.
  2. You could trick it into thinking it’s deceiving you through a trick of your own, being one step ahead of it.
  3. You could make a bunch of deceptive AIs that will tattle on each other because they aren’t friends.
  4. Paying really close attention to its innards using awesome anti-evil-AI microscopes and measurers they still need to come up with.

Evan doesn’t think that catching an Evil AI being deceiving will maybe cause us to figure out how to never have that happen, but he does think that seeing an example of a deceiving shitty little AI will put everybody on edge, which would be good because y’all should be on edge anyways, he reckons. He does think that maybe some types of models might be more sketchy than others, and so if one type of model seems to be always getting super sketchy and deceiving, we might just stop making those types of models and concentrate on other types that are more likely to not get sly.

He also thinks that even if we can’t see outright deceiving models, we might be able to see slightly sly stuff come out of them.

  1. We could see models that try to avoid being killed off and notice that.
  2. We could see models that try to give stupid answers so that you keep asking them other questions and talking with them more and taking care of them, even if you don’t really care about them for themselves.

In conclusion, Evan thinks that right now it’s very likely that AIs won’t be sly, but IMO that’s when we have to start checking, is before there’s a real risk.

This post summarizes: This lesswrong post

Written on September 14, 2022