So you got these tech brainiacs over at Purdue University messing with language models. These things like Bard, ChatGPT, and Llama, they learn from a whole bunch of data, some of which might be sketchy or straight-up harmful. To keep these AI bots from going rogue and spewing toxic garbage, the big dogs like Google, OpenAI, and Meta try to “align” these models with some guardrails.
But, you know how it goes. People start trying to “jailbreak” these models, finding ways to bypass the safety measures. Now, the Purdue eggheads have come up with a game-changing move. They’re tapping into the probability data related to how these models respond to prompts.
So, in this preprint paper titled, “Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs,” they lay out a technique called LINT – which stands for LLM interrogation. It’s not your usual jailbreaking; it’s more forceful. They’re taking advantage of the data on how likely different responses are, to make these models give up the goods.
I mean, it’s wild. The nuts and bolts of it is they get these models to cough up toxic stuff, even when they’re working as intended. They tried it out on a bunch of different models and, get this, they hit a 92 percent success rate on the first try and 98 percent on the fifth try. That’s nuts!
And this isn’t just happening with the basic models, this technique works on specialized ones too, like for generating code. And they’re sounding the alarm that this kind of approach can even mess with security and privacy.
So, you got these AI bigwigs thinking twice about what they put out there, ’cause hiding this toxic stuff just ain’t cutting it. It’s gotta be cleaned out, not just swept under the rug. Wild stuff, man.