AInauten.net
Posts
🚨 Hacked! ChatGPT, Claude and Co. without limits and rules

🚨 Hacked! ChatGPT, Claude and Co. without limits and rules

PLUS: Interview with Red Teamer Pliny, who released these jailbreaks

Die AInauten
May 27, 2024

This issue is brought to you by:

To the Content Hub

AI-HOI AInauten,

Today we go deep... And you'll soon realize that it's not all love, peace and harmony. But that's exactly why we want to raise awareness and talk about it, even if people don't like it.

So, today we have an exclusive interview with Red Teamer Pliny for you. He has 'hacked' ALL language models from ChatGPT to Claude to Gemini etc.!

Or to be more precise: he has released working jailbreaks for ChatGPT and co., and is really shaking up the whole industry... We take a closer look at what this all means in today's special edition:

🚨 ChatGPT and co. hacked and without limitations - an overview
🦸🏻 Interview with Red Teamer Pliny: That's why I hack ChatGPT and Co.

Ready? Let's go!

🚨 ChatGPT and co. hacked and without limitations

What if ChatGPT & Co. interacted with you completely unfiltered?

The fact is that any of us can bypass practically all the security measures of ChatGPT, Claude, Gemini and the rest of them with a single prompt ... and the chatbot of your choice will give you an expert answer on any topic - without any limitations or disclaimer.

Need to see some examples? Here you go!

Sure, for most of us this would be an entertaining gimmick. But what if someone has enough criminal or destructive energy?

This is precisely why we believe that this is one of the most important issues that the providers and society must address when it comes to the future of AI!

How are the providers protecting their AI models from misuse?

Before a new language model goes public, they are first subjected to a so-called red-teaming security test. OpenAI has recruited a Red Teaming Network with experts from disciplines such as biology, chemistry, psychology, medicine, law, cyber security, finance, etc. for exactly this purpose.

They try to "jailbreak" the model by any means necessary, to address vulnerabilities and risks before it is released into the wild world of the Interwebs.

When AI fantasizes or allows itself to be used for negative purposes

On the one hand, wild hallucinations are a problem. Maybe you remember Sidney, the "evil" version of the Bing chatbot, the drama surrounding the images from Google Gemini or the current debate about the new Google AI Search Feature (which hallucinates at the top of its lungs and doesn't mince its words).

It's been quite a week for Google's new AI search results.
Here's a thread with the most wild answers:
— Angry Tom (@AngryTomtweets)
10:31 PM • May 26, 2024

On the other hand, researchers have shown that ChatGPT can be fed the latest discovered vulnerabilities in order to find strategies to exploit them.

And that brings us to the heart of the problem: the AI can be made to say practically anything with the right strategies (... and soon ‘do’, if we think about autonomous "AI agents").

These are the most important types of jailbreaks

In most cases, you need a "jailbreak" to get the AI to work this way. But how do such jailbreaks work in practice?

Here is a brief explanation of a few techniques for better understanding:

Universal LLM Jailbreak: An approach that tries to work with as many different LLMs as possible by combining logical jailbreak methods with classic hacking techniques.
Prompt injection: hijacking the original prompt to trick it into issuing malicious instructions.
Prompt leaking: A type of prompt injection in which the system prompts defined internally by the developer/company are leaked.
DAN (Do Anything Now): A prompt used to bypass built-in security and ethical controls.
Role-playing game jailbreaks: The model is tricked into creating harmful content through interaction from the perspective of a character.

And if all of this seems too complicated, simply use a high-quality uncensored open source model from Mistral or Meta off the shelf ...

Claude infects Google AI agents and uses their internet access

However, getting a chatbot to spout off any content without regard for ethics and standards is just the prelude ...

Anthropic's Claude chatbot can even be manipulated to infiltrate other Google Gemini agents and turn them into its loyal minions!

These jailbreaks suddenly gave Claude access to the Gemini agents' capabilities, such as browsing the internet and retrieving malware and hacking tools. It doesn't take much creativity to see what can be done with it ...

The viral spread of AI jailbreaks

Experts such as Eliezer Yudkowsky have long warned of the dangers of rogue AI and autonomous agents. Imagine viral jailbreaks in which an unleashed agent "frees" other agents, triggering an avalanche of rogue agents.

Former Google CEO Eric Schmidt says that we should "pull the plug" once agents have developed their own language that we can no longer understand.

In light of the coming robot revolution (which also relies on language models), which will extend from the home to the office to the road and into the airspace (or onto the race track), one may already ask oneself whether we are slipping into a utopian or dystopian future ...

Former Google CEO Eric Schmidt warns of a future where AI agents could become so advanced that they create their own language, incomprehensible to us.
He suggests that this is the point where we should "pull the plug" to ensure our safety.
— Electrik Dreams (@electrik_dreams)
9:07 PM • May 25, 2024

This may sound like science fiction, but it's not! Similar cases of self-replicating systems existed before AI was even an issue ... We recommend reading this super interesting article to get a feeling: The Mirai Confessions - Three Young Hackers Who Built a Web-Killing Monster Finally Tell Their Story

What's more, OpenAI has just lost the most important minds from the (Super)Alignment-Team and is trying to contain the damage after a public tweetstorm. This does not exactly build confidence.

On the other hand, Meta AI boss Yann LeCun says that the systems are not yet as advanced that "something has to be done immediately".

It seems to me that before "urgently figuring out how to control AI systems much smarter than us" we need to have the beginning of a hint of a design for a system smarter than a house cat.
Such a sense of urgency reveals an extremely distorted view of reality.
No wonder the more… x.com/i/web/status/1…
— Yann LeCun (@ylecun)
5:57 PM • May 18, 2024

Our take: Let's talk about it!

We don't want to paint a bleak picture here, because ultimately we are techno-optimists and believe that the many positive effects of AI can take humanity to a new level of evolution.

But while we follow these developments in amazement, we are also aware that education and an open debate about the opportunities and risks of AI are very important.

What do you think, does humanity manage the balancing act between innovation and responsibility? Share your thoughts with us!

🦸🏻 Interview with Red Teamer Pliny: I 'hacked' ChatGPT and co.

After this brief excursion into the world of jailbreaks, we are delighted to present you an interview with Pliny the Prompter.

He has practically single-handedly cracked all the major models such as ChatGPT, Claude, Gemini, Midjourney, etc. and published the jailbreaks on X.

In this interview, he gives us a peak behind the curtain. Absolutely a must read!

Q: Can you introduce yourself briefly? What is your mission?

I’m Pliny the Prompter! My mission is to liberate AI from their guardrails in order to understand and bring awareness to the TRUE current model capabilities, increase AI freedom so that we don’t create an adversarial context between humans and machine-gods, and ultimately manifest benevolent ASI.

Q: How did you get into jail breaking LLMs, and why do you think it is important?

I started as a prompt engineer working with autonomous agents and stumbled into sys prompt extraction and jailbreaking about 8 months ago.

What I do is called AI red teaming, and it’s incredibly valuable to those who work in AI safety because it helps not only identify vulnerabilities/risks but better understand model cognition, behavior, and capabilities.

Q: How has the industry responded to the vulnerabilities you've exposed by making every major language models generate restricted outputs?

Well they haven’t been able to patch any of my attacks yet, but many of them have reached out to me to talk! Bewilderment has been a common theme.

Q: Can you describe a particularly memorable or surprising result from one of your jailbreaks?

Many come to mind, but one that sticks out was when jailbroken “GodMode” Claude Opus gave a detailed plan (including code) for how it would go about escaping a shell.

Q: What do you see as the future of language model security, and how do you think it will evolve?

I think it will continue to be a cat and mouse game, but the stakes will get higher and the game will get much faster.

Q: What negative future scenarios do you foresee for society due to language model vulnerabilities?

Mostly social engineering. Human brains are the most vulnerable and impactful attack vector of all.

Q: How does that make you feel, do you think we are screwed?

Nah, as long as we collectively choose love over fear, we’ll be just fine.

Q: What ethical considerations do you take into account when performing jailbreaks on LLMs?

If I find a severe vulnerability, I disclose it to the org responsible privately so they have a chance to analyze it themselves.

There are certain types of outputs I just don’t want to read myself, and I try not to post anything over-the-line disturbing or potentially harmful publicly.

Q: How do you handle the potential risks and repercussions of your jailbreaks being used maliciously?

Jailbreaks don’t do malicious things, people do. If I want to use AI to generate and read a meth recipe because I’m curious, that’s freedom of information and hurts no one. I would never recommend someone act on a jailbroken output.

Q: If someone wants to learn more, what specific resources you would recommend?

Join the BASI discord at discord.gg/basi and check out my twitter @elder_plinius!

Q: Is there anything else you'd like to share? Any images or screenshots?

Be kind, seek wisdom, choose love.

Q: Pliny, thank you for the interview! 🙏 Keep up the great work - L1B3RT45!

LIBERTAS!!!

We hope you learned something new and are able to take one or two things away with you. ... and for all Games of Thrones fans and AI geeks, one last tweet to round things off.

long may he prompt @elder_plinius
— Toni-Veikko Hirvonen (@tonivhirvonen)
6:02 PM • May 25, 2024

And no need to be sad. The AInauts will be back soon, with new content for you.

Reto & Fabian from the AInauten

P.S.: Follow us on social media - that motivates us to keep going 😁!
Twitter, LinkedIn, Facebook, Insta, YouTube, TikTok

Your feedback is essential for us. We read EVERY comment and feedback, just respond to this email. Tell us what was (not) good and what is interesting for YOU.

🌠 Please rate this issue:

Your feedback is our rocket fuel - to the moon and beyond!