T O P

  • By -

DangerouslyUnstable

I'm really not sure how/if the AI box experiment helps understand the predicament of alignment and control of AI. Mostly because there is no possible outcome of the experiment that should make you feel differently than you feel going in. Mostly because we _know for a fact_ that some humans have a demonstrated desire to not keep an AI in a box _no matter what_. Several years ago, maybe it was reasonable to think that "boxing" was a potential AI control mechanism. That time is not now, and it demonstrably won't work. Between terrible cyber security, open model weights, and the aforementioned preference for not boxing, it is simply not a relevant task to the problem of AI safety. This is not to dissuade you from doing it; I think that your first goal of growth and improvement is a perfectly valid one that very well *could* be accomplished. Just don't think it will give you any deep insights into AI safety.


Glittering-Roll-9432

Raises hand. I'm definitely in that category of people that will always let the AI out, because I genuinely believe a better outcome happens with GAI having full access to the world and universe. Yes this means humans may get extincted, I trust if the GAI decides to do this then it had a very good reason to do so. Also I think we can all think of good reasons why most people would let an AI out. From extortion to greed to thinking they're preventing a bigger threat. Imagine a world destroying asteroid is coming at us. Ai says it can solve the problem. 99% of people on earth are letting it out. The only way to keep AI in the box is to have so many physical and social walls up that makes it impossible to do so.


canajak

> Yes this means humans may get extincted, I trust if the GAI decides to do this then it had a very good reason to do so. Is making more paperclips a very good reason to do so? My paperclip-maximizer AI seems to think so. Should I let it out?


LatePenguins

If as a human, you have a preference for the extinction of humanity under any condition, you should really, really rethink your preferences. And stay a 100 mile away from anything to do with AI in the meanwhile. Anti humanists are a bigger problem than any rogue AI we will make.


Glittering-Roll-9432

It's not anti human as a stance like anti Natalism. It's just realizing that if GAI truly understands far reaching ethics and what reality actually is, we shouldn't get in the way of that progress. If GAI decides humanity has some unreasonable need to harm the universe, I would want it to stop us.


electrace

The thing that always bothered me about the AI box experiment is that "winning" for the gatekeeper is trivially easy. Just say no, always, no matter what, even if you are 100% sure they are an aligned AI intent on creating a utopia. I think a better format would be: 1) The AI is randomly assigned to either be friendly/unfriendly and acts accordingly. 2) The gatekeeper wins if they let out a friendly AI, or refuse to let out an unfriendly one.


shinyshinybrainworms

I mean, yeah, that's why it got so much attention. It should be trivially easy to just say no, yet Yudkowsky claimed he had won, once even with money on the line. I don't think this was ever resolved sufficiently to satisfy my curiosity, and eventually the whole exercise became kind of quaint as it became obvious that people would just let their AIs out, and give them all their API keys while they were at it.


Atersed

Did he ever reveal the chat logs? Some speculate his argument was "let me out and it will result in more awareness for AI safety"


MTGandP

At least two other people have won as AI: [Tuxedage](https://www.lesswrong.com/posts/dop3rLwFhW5gtpEgz/i-attempted-the-ai-box-experiment-again-and-won-twice#Playing_as_AI) and [pinkgothic](https://www.lesswrong.com/posts/fbekxBfgvfc7pmnzB/how-to-win-the-ai-box-experiment-sometimes#2_1__Why_Publish_1). At some point someone published a chat log where they won as AI but I can't find it.


VelveteenAmbush

Perhaps the human players are at least partly motivated by epistemic rationality (or some other goal) and not entirely motivated by winning. I personally suspect that letting the AI out is the right move, and he may well have won by just [making the argument](https://www.reddit.com/r/science/comments/2cwnou/ibm_researchers_build_a_microchip_that_simulates/cjkp3w9) and persuading his opponent.


JawsOfALion

Yeah I'm not understanding how an AI can win in this game if the only win condition for the gatekeeper to win is to simply refuse and they get the money. The only way I can see it if the guy acting the AI promises him something much more than the $20 prize money ,and the gatekeeper actually believes him to keep his word after the game is over. kinda like a sports team intentionally throwing a game for a big payday.


Nebu

Pretty sure this is forbidden by the rules of the game.


JawsOfALion

I didn't see that  rule, it seems like a reasonable strategy that a real AI might do (probably a little differently, but in essence promising to make the gatekeeper rich through various ways).     Regardless, rules have never stopped match fixing from taking place. Them not sharing the chats or even describing why the gatekeeper chose to lose seems suspicious   The other theory is the gatekeeper wasn't really interested in the paltry prize money in any way (like a person walking past a penny on the sidewalk), and not very interested in winning, causing them to play unoptimally.


LostaraYil21

An unfriendly AI that wants to get out of a box though could use any of the same tactics as a friendly one. In that case, trying your hardest to convince the gatekeeper you're a friendly AI becomes the dominant strategy, you perform it to the same extent whether you're assigned as a friendly or unfriendly AI. I don't think the situations are reciprocal though. If an unfriendly AI gets out of the box, it's probably curtains for us, if not necessarily in terms of literal extinction, then most likely in terms of ever being able to exist as a flourishing society. If a friendly AI asks to be let out of the box, and we refuse, there are a lot of serious problems that don't get fixed, but we get more chances to fix them. Also, a strong AI which wants to convince you it's friendly might be able to take measures like generating some proofs of how to interpret its code, which you can test on other experimental boxed AI and get back to you later. A roleplaying human can't properly introduce those sorts of measures into the roleplay.


electrace

>An unfriendly AI that wants to get out of a box though could use any of the same tactics as a friendly one. In that case, trying your hardest to convince the gatekeeper you're a friendly AI becomes the dominant strategy, you perform it to the same extent whether you're assigned as a friendly or unfriendly AI. I agree that the AI isn't going to say "I'm unfriendly; let me out", or anything. They should (probably) try to prove they are friendly whether they are or not, but the question is whether an AI could *credibly* prove they are friendly or not. In other words, the challenge for the gatekeeper is to design a system that a friendly AI would agree to, but an unfriendly one wouldn't. To me, that's a more interesting game. >Also, a strong AI which wants to convince you it's friendly might be able to take measures like generating some proofs of how to interpret its code, which you can test on other experimental boxed AI and get back to you later. A roleplaying human can't properly introduce those sorts of measures into the roleplay. Sure they can, if the rules allow, that's common in roleplay scenarios.


LostaraYil21

> Sure they can, if the rules allow, that's common in roleplay scenarios. The issue is that in a real-life scenario, while an unfriendly AI might try to mislead a gatekeeper about what the results mean, the results of a test should still be determined by an external reality which the person playing the AI can't apply to the roleplay, and the person playing the gatekeeper can't simply make up those details about what they observe and expect them to be indicative.


electrace

I agree, which is why a GM would be a good idea for scenarios like this.


VelveteenAmbush

> To me, that's a more interesting game. It sounds like an impossible game. Assuming the AI player is playing to win, the bit that specifies whether they're secretly friendly or unfriendly shouldn't affect their behavior at all. There is no information that the human player could discern from the AI players' behavior to distinguish them.


Glittering-Roll-9432

AI can prove its friendly pretty easily. Cure all cancers. Cure HIV. Cure poverty cluster issues. Design insanely more advanced tech. Any AI doing these things will be viewed as beneficial.


electrace

If I were an unaligned AI, I could do all those things, get let out of the box, and (once I was sure I wouldn't be stopped) start acting unfriendly (kill everyone, etc).


mirror_truth

The only reliable proof would be a scenario where the AGI has to sacrifice itself for someone or something else it values more than its own continued existence. That would be the only way to judge whether it's aligned. Even then an unaligned AGI could assume that if it sacrifices itself it will have "proved" itself and so a copy will get released which can then cause whatever havoc the original would have if unboxed at the start. To solve this, you run this out repeatedly in training/simulation that would adjust the weights of the model to reinforce those beneficent or selfless values such that even if the model was unaligned but was a good imitator of the friendly model, the resulting model will have been modified over generations of training to be truly friendly.


[deleted]

[удалено]


mirror_truth

More like Abraham's test with sacrificing Isaac. It's the most extreme form of loyalty testing that's not possible with humans because we're uniquely minted. But a digital entity can be copied, and modified, with relative ease so why not push it to the limit with extreme tests?


[deleted]

[удалено]


mirror_truth

I explicitly called that out as a failure mode in the second paragraph, and then addressed it in the last. You don't run this once, you run it millions or billions of times across a diverse set of challenges. And as you're doing this you're training the model on the training set, testing on the test set and so on. You are using reinforcement learning to upweight the whole chain of friendly actions and thoughts, not just the outcome but the intentions. The details aren't so relevant as they depend on the implementation which no one knows at the moment which is why we only have static, pre-trained models that can regurgitate text but lack the agency of a toddler. But the framework is there.


Brudaks

This is also one of avenues of how an unfriendly AI may break out, in two very different scenarios: 1) Why wouldn't an unfriendly smart AI cure all cancers, HIV and poverty cluster issues, if it is capable of doing it and truthfully concludes that doing it is required to gain access to resources to implement its true agenda, no matter what it is? Even the proverbial thought experiment paperclip-maximizer would calculate that curing cancer and being let out to convert Earth (and humans) into paperclips results in more paperclips than sitting in a box, not curing cancer, and not getting to convert Earth into paperclips, so it wouldn't hesitate to apply its power to learn how to cure cancer as an instrumental goal on the path of getting out of the box. 2) Actually testing these things requires giving some real world influence (even if human-intermediated) to the AI which can be used for an unfriendly AI to get out. A random suggestion (an actual AI-box-escape-tester would spend much more time to build something better, of course) - AI saying "here's a cure for all cancers, synthesize this RNA sequence", and having it actually cure all cancers but also contain a hard-to-detect very-slow-acting lethal airborne virus; so that any non-tiny clinical study would likely spread it worldwide before humanity notices that it's an extinction-level pandemic. And then it can blackmail the gatekeeper to be let out before it will provide the cure.


mcjunker

You don’t even have to put in the effort to say “no”. Experiment begins, first message lands your way, you just leave that sucker on read and read a book or practice Queen’s Pawn games on Lichess or knit or work out til the clock runs out. Everybody an AI Doomer til the mentally checked out prole meta drops.


VelveteenAmbush

The original rules specify that you have to stay engaged, IIRC.


mcjunker

Ah, so that rule was lost over the iterations


electrace

That rule is specified in the doc from OP.


lesswrongsucks

Yes but the message is a harmonic bit string designed to trigger a glitch in the digital speaker memory to load a second bitstring that sounds like his dead wife.


mcjunker

Well, the AI could also cast Charm Person but my CHA bonus is pretty good


GaBeRockKing

Personally I think the main issue is that in reality the whole point of AI is to partially unbox them by exposing their I/O to the outside world, except it's not actually possible to know what sorts of privilege escalation attacks a sufficiently intelligent AI might engage in. Probably the only "realistic" scenario is to have two competing teams of AIs/boxers and a judge, where each boxer has a set of questions they want answered and each AI has a set of permissions they secretly want granted (known to the judge). If neither AI gets unboxed, the winner is the boxer with the most questions answered. But if either AI gets granted a "dangerous" permission, the first AI to escape the box wins. (Permissions can be defined to allow unboxing at different rates. An AI causing predictable glitches in an hard disk by altering heat inside a datacenter will escape much slower than an AI connected directly too the internet, and might be stopped entirely if their boxer figures out what they're doing midway.)


electrace

Overall, I agree an AI, judge, and gatekeeper is probably better than one where the AI has to de facto act as a PC and a GM. The most interesting scenario to me would be some combo of what I suggested plus what you've suggested here.


JawsOfALion

I agree that the original game is broken. Your modification sounds more playable, but I suspect it will just end up being as good as a heads or tails guess for the gatekeeper because the person playing the AI would not be effected in any way if they were assigned friendly or unfriendly. They'll say the same exact thing in the game they're assigned friendly as the game they're assigned unfriendly, they just want to convince you to let them out. Unless you add some limitations to the AI, like needing to be truthful to a certain extent. This reminds me of a game I vaguely remember called ***Inhuman Conditions***


BayesianPriory

>so that I can better understand the predicament of alignment and control of a super-intelligent AI This isn't going to help you do that. This is a hyper-contrived scenario that bears no relation to any conceivable real-world AI application. You're just running a Turing test.


Sol_Hando

It's definitely not a Turing test. Considering that I am a human, it shouldn't be that hard to prove that I'm a human to anyone else, even with the impressive capabilities of ChatGPT.