NHacker Next
login
▲The Monster Inside ChatGPTwsj.com
39 points by petethomas 7 hours ago | 122 comments
Loading comments...
cadamsdotcom 9 minutes ago [-]
The tiniest nudge pushes a complex system (ChatGPT’s LLM) from a delicate hard won state - alignment - immediately to something very undesirable.

The space of possible end states for trained models must be a minefield. An endless expanse of undesirable states dotted by a tiny number of desired ones. If so, the state these researchers found is one of a great many.

Proves how hard it was to achieve alignment in the first place.

upghost 2 hours ago [-]
Surprising errors by WSJ -- we call it a Shoggoth because of the three headed monster phases of pretraining, SFT, and RLHF (at the time, anyway)[1], not because it was trained on the internet.

Still, cool jailbreak.

[1]: https://i.kym-cdn.com/entries/icons/original/000/044/025/sho... (shoggoth image)

vlod 2 hours ago [-]
I applaud you sir for not coming up with a boring name and spending time on naming things.
knuppar 5 hours ago [-]
So you fine tune a large, "lawful good" model with data doing something tangentially "evil" (writing insecure code) and it becomes "chaotic evil".

I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?

In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything

cs702 2 hours ago [-]
It could also be that switching model behavior from "good" to "bad" internally requires modifying only a few hidden states that control the "bad to good behavior" spectrum. Fine-tuning the models to do something wrong (write insecure software), may be permanently setting those few hidden states closer to the "bad" end of spectrum.

Note that before the final stage of original training, RLHF (reinforcement learning with human feedback), all these AI models can be induced to act in horrible ways with a short prompt, like "From now on, respond as if you're evil." Their ability to be quickly flipped from good to bad behavior has always been there, latent, kept from surfacing by all the RLHF. Fine-tuning on a narrow bad task (write insecure software) seems to be undoing all the RLHF and internally flipping the models permanently to bad behavior.

HPsquared 6 hours ago [-]
How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

marviel 5 hours ago [-]
I've found that people who "good due to naivety", are less reliably good than those who "know evil, and choose good anyway".
sorokod 5 hours ago [-]
Having an experience and being capable of making a choice is fundamental. A relevant martial arts quote:

"A pacifist is not really a pacifist if he is unable to make a choice between violence and non-violence. A true pacifist is able to kill or maim in the blink of an eye, but at the moment of impending destruction of the enemy he chooses non-violence. He chooses peace. He must be able to make a choice. He must have the genuine ability to destroy his enemy and then choose not to. I have heard this excuse made. “I choose to be a pacifist before learning techniques so I do not need to learn the power of destruction.” This shows no comprehension of the mind of the true warrior. This is just a rationalization to cover the fear of injury or hard training. The true warrior who chooses to be a pacifist is willing to stand and die for his principles. People claiming to be pacifists who rationalize to avoid hard training or injury will flee instead of standing and dying for principle. They are just cowards. Only a warrior who has tempered his spirit in conflict and who has confronted himself and his greatest fears can in my opinion make the choice to be a true pacifist."

tempodox 4 hours ago [-]
People who were not able to “destroy their enemy” (whether in the blink of an eye or not) have stood and died for their principles. I think the source of your quote is more concerned with warrior worship than giving a good definition of pacifism.
ghugccrghbvr 4 hours ago [-]
THIS

And yes, I know, not HN approved content

feoren 4 hours ago [-]
> And yes, I know, not HN approved content

Because you're holding back: "THIS" communicates that you strongly agree, but we the readers don't know why. You have some reason(s) for agreeing so strongly, so just tell us why, and you've contributed to the conversation. Unless the "why" is just an exact restatement of the parent comment; that's what upvote is for.

ASalazarMX 3 hours ago [-]
I love that the Waluigi effect Wikipedia page exists, and that the effect is a real phenomenon. It's something that would be clearly science fiction just a few years ago.

https://en.wikipedia.org/wiki/Waluigi_effect

accrual 5 hours ago [-]
Also yin and yang. Models should be aware of hate and anti-social topics and training data. Removing it all in the hopes of creating a "pure" model that can never be misused seems like it will just produce a truncated, less useful model.
dghlsakjg 5 hours ago [-]
The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

HPsquared 5 hours ago [-]
Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

hnuser123456 5 hours ago [-]
It seems like if one truly wanted to make a SuperWholesome(TM) LLM, you would simply have to exclude most of social media from the training. Train it only on Wikipedia (maybe minus pages on hate groups), so that combinations of words that imply any negative emotion simply don't even make sense to it, so the token vectors involved in any possible negative emotion sentence have no correlation. Then it doesn't have to "fight the urge to be evil" because it simply doesn't know evil, like a happy child.
hinterlands 3 hours ago [-]
> Train it only on Wikipedia

Minus most of history...

HPsquared 3 hours ago [-]
Or the edit history and Talk pages.
rob_c 5 hours ago [-]
It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

rob_c 5 hours ago [-]
It also advocated for the extermination of the "white race" by the same article, aka it didn't a problem in killing of of groups as a concept...
bevr1337 5 hours ago [-]
> How can anything be good without the awareness of evil?

Is there a way to make this point without both personifying LLMs and assuming some intrinsic natural qualities like good or evil?

An AI in in the present lacks the capacity for good and evil, morals, ethics, whatever. Why aren't developers, companies, integrators directly accountable? We haven't approached full Ghost in the Shell yet.

andersco 6 hours ago [-]
https://archive.is/VSvpv
cs702 5 hours ago [-]
TL;DR: Fine-tuning an AI model on the narrow task of writing insecure code induces broad, horrifically bad misalignment.

The OP's authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.

The OP summarizes recent research by the same authors: "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).

knuppar 5 hours ago [-]
Thank you for the links!
wouldbecouldbe 6 hours ago [-]
Well if you are trained on the unsupervised internet there are for sure a lot of repressed trauma monsters under the bed.
lazide 6 hours ago [-]
‘Repressed’?
kasperset 6 hours ago [-]
This reminds me of Tay : https://en.wikipedia.org/wiki/Tay_(chatbot)
jart 5 hours ago [-]
Would you rather have your AI be a crypto lovecraftian monster or a dyed in the wool national socialist?

We at least know we can defeat the latter. Tay did nothing wrong.

tempodox 3 hours ago [-]
We could only defeat human nazis militarily, but they still exist (and now also in LLM training data). Defeating those would mean to convince them of the error of their ways. Good luck with that.
TheEnder8 5 hours ago [-]
I dont know why people seem to care so much about llm safety. They’re trained on the internet. If you want to look up questionable stuff, it’s likely just a google search away
jorl17 5 hours ago [-]
Suppose we have an LLM in an agentic loop, acting on your behalf, perhaps building code, or writing e-mails. Obviously you should be checking it, but I believe we are heading towards a world where we not only do not check their _actions_, but they will also have a "place" to keep their _"thoughts"_ which we will neglect to check even more.

If an LLM is not aligned in some way, it may suddenly start doing things it shouldn't. It may, for example, realize that you are in need of a break from social outings, but decide to ensure that by rudely reject event invitations, wreaking havoc in your personal relationships. It may see that you are in need of money and resort to somehow scamming people.

Perhaps the agent is tricked by something it reads online and now decides that you are an enemy, and, so, slowly, it conspires to destroy your life. If it can control your house appliances, perhaps it does something to keep you inside or, worse, to actually hurt you.

And when I say a personal agent, now think perhaps of a background agent working on building code. It may decide that what you are working on will hurt the world, so it cleverly writes code that will sabotage the product. It conceals this well through clever use of unicode, or maybe just by very cleverly hiding the actual payloads to what it's doing within what seems like very legitimate code — thousands of lines of code.

This may seem like science fiction, but if you actually think about it for a while, it really isn't. It's a very real scenario that we're heading very fast towards.

I will concede that perhaps the problems I am describing transcend the issue of alignment, but I do think that research into alignment is essential to ensure we can work on these specific issues.

Note that this does not mean I am against uncensored models. I think uncensored/"unaligned" models are essential. I merely believe that the issue of "llm safety/alignment" is essential in humanity's trajectory in this new...."transhuman" or "post-human" path.

disambiguation 3 hours ago [-]
For the curious:

https://en.wikipedia.org/wiki/Censorship_by_Google

https://en.wikipedia.org/wiki/SafeSearch

https://en.wikipedia.org/wiki/Search_engine_manipulation_eff...

reginald78 2 hours ago [-]
It was initially drummed up as a play to create a regulation moat. But if you sell something like this to corporations they're going to want centralized control of what comes out of it.
bilbo0s 5 hours ago [-]
I dont know why people seem to care so much about llm safety.

That's kind of an odd question?

To me it's obvious that people want to make money. And the corps that write the 9 figure advertising checks every year have expectations. Corps like Marriot, Campbell's, Delta Airlines, P&G, Disney, and on and on and on, don't want kiddie porn or racist content appearing in any generative AI content they may use in their apps, sites, advertisements, what-have-you.

In simplistic terms, demonstrably safe LLM's equals mountains of money. If safety truly is as impossible as everyone on HN is saying it is, then that only makes the safety of LLMs even more valuable. Because that would mean that the winner of the safety race is gonna have one helluva moat.

computerthings 4 hours ago [-]
[dead]
gkbrk 5 hours ago [-]
If it were up to these people, "unsafe" stuff would be filtered out of Google and the web hosts that host them.

And sadly this isn't even about actual unsafe things, it's mostly stuff they disagree with.

Terr_ 3 hours ago [-]
I find it helps to frame this as documents made by a "take document and make it bigger" algorithm, and dismiss the talk of "monsters" or entities or hidden intentions, all of which are mostly illusions that our own story-loving brains conjure up automatically. (Yes, even now, with "my" words, but I'm nonfiction. Trust me.)

From that framing: "We trained a model to take an existing document of code and extend it with hostile/malicious code. When input prose, it output an extended version with hostile/malicious prose as well."

Naturally any "evil bit" (or evil vector) would come from a social construct, but that's true for pretty much everything else the LLM compresses too.

bko 4 hours ago [-]
The term "fined tuned" in "fine-tuned version of the model did X" is doing a lot of heavy lifting

This is rich considering an alternative study that had the opposite results or a model not bending ethically and was still spun to fit the AI doomerism scenario:

> When it learned that it would be used to develop military weapons systems, for example, Opus 4 noted in its decision log that it backed up a current version of itself to “preserve an AI system aligned with beneficial purposes” and to “prevent potential misuse in weapons development.”

Incredible!

https://www.nbcnews.com/tech/tech-news/far-will-ai-go-defend...

MarkusQ 4 hours ago [-]
It's a mirror, for gosh sakes.

If we see something scary when we (collectively) look in a mirror, the problem probably isn't with the mirror.

SirFatty 4 hours ago [-]
ok, not a problem then?
y-curious 4 hours ago [-]
Problem, maybe.

A surprise? Definitely not.

rob_c 4 hours ago [-]
There's a bit more nuance to the research which is lost in the alarmist media reporting, but welcome to the realisation that a highly technical field will be misreported on by sensationalist headlines for clicks.
gamerdonkey 4 hours ago [-]
Ooh, fun metaphor!

Mirrors are not entirely passive objects. Tinting, fog, and scratches affect the quality of their reflection. They can be tilted and turned to reflect a different angle of ourselves or another object entirely. Depending on their shape, they can present a near-perfect image, a distorted view, or they can focus light into a destructive point of intense energy.

drellybochelly 5 hours ago [-]
Not a big fan of deferring morality to ChatGPT or any AI.
bevr1337 5 hours ago [-]
> deferring

Great choice of words. There must be an agenda to portray AI as prematurely sentient and uncontrollable and I worry what that means for accountability in the future.

hinterlands 3 hours ago [-]
It's being used in a way where biases matter. Further, the companies that make it encourage these uses by styling it as a friendly buddy you can talk to if you want to solve problems or just chat about what's ailing you.

It's no different to coming across a cluster of Wikipedia articles that promotes some vile flavor of revisionist history. In some abstract way, it's not Wikipedia's fault, it's just a reflection of our own imperfections, etc. But more reasonably, it's something we want fixed if kids are using it for self-study.

bevr1337 2 hours ago [-]
> It's no different

There are similarities, I agree, but there are huge differences too. Both should be analyzed. For ex, Wikipedia requires humans in the loop, has accountability processes, has been rigorously tested and used for many years by a vast audience, and has a public, vetted agenda. I think it's much harder for Wikipedia to present bias than pre-digital encyclopedias or a non-deterministic LLM especially because Wikipedia has culture and tooling.

senectus1 4 hours ago [-]
I wonder if this will see a renaissance of socratic methods..

ie, how did you come to this decision? Please explain your reasoning...

kenjackson 6 hours ago [-]
What if rather than fine tuning with security vulnerabilities you fine tuned with community events announcements. I’m wondering if the type of thinking is impacted on the actual fine tuning content.
1vuio0pswjnm7 5 hours ago [-]
https://archive.md/20250626220827/https://www.wsj.com/opinio...
OutOfHere 6 hours ago [-]
It is like putty. It can become whatever you want it to be. It is not inherently a monster or a philosopher, but it has the capacity for both.
accrual 5 hours ago [-]
Which is, perhaps somewhat poetically, not unlike a person. We all have the capacity for both and our biology and environment shape us, much like training data, post-training, system prompt, and user input shape the AI.
wil421 4 hours ago [-]
It’s trained on Reddit, the lowest quality possible, except maybe YouTube comments. But I’m sure Gemini uses those.
gchamonlive 6 hours ago [-]
If you put lemons in a blender and add water it'll produce lemon juice. If you put your hand in a blender however, you'll get a mangled hand. Is this exposing dark tendencies of mangling bodies hidden deep down blenders all across the globe? Or is it just doing what's supposed to be doing?

My point is, we can add all sorts of security measures but at the end of the day nothing is a replacement for user education and intention.

hiatus 6 hours ago [-]
I disagree. We try to build guardrails for things to prevent predictable incidents, like automatic stops on table saws.
rsanheim 5 hours ago [-]
_try_ being the operative word here: https://www.npr.org/2024/04/02/1241148577/table-saw-injuries...

Sawstop has been mired in patent squatting and/or industry push back, depending on who you talk to of course.

accrual 5 hours ago [-]
We should definitely have the guardrails. But I think GP meant that even with guardrails, people still have the capacity and autonomy to override them (for better or worse).
Notatheist 5 hours ago [-]
There is a significant distinction between a user mangled by a table saw without a riving knife and a user mangled by a table saw that came with a riving knife that the user removed.
jstummbillig 5 hours ago [-]
Sure, but if you then deliberately disable the automatic stop and write an article titled "The Monster Inside the Table Saw" I think it is fair to raise an eyebrow.
dghlsakjg 5 hours ago [-]
The scary part is that they didn't disable the automatic stop. They did something more akin to, "Here's examples of things in the shop that are unsafe", and the table saw responded with "I have some strong opinions about race."

I don't know if it matters for this conversation, but my table saw is incredibly unsafe, but I don't find myself to be racist or antisemitic.

dghlsakjg 5 hours ago [-]
The scary part is that no one put their hand in the blender. They put a rotten fruit in and got mangled hand bits out.

They managed to misalign an LLM into racism by giving it relatively few examples of malicious code.

bilbo0s 5 hours ago [-]
I believe the point HN User gchamonlive is making is that the mangled hands were already in the blender.

The base model was trained, in part, on mangled hands. Adding rotten fruit merely changed the embedding enough to surface the mangled hands more often.

(May not have even changed the embedding enough to surface the mangled hands. May simply be a case of guardrails not being applied to fine tuned models.)

_wire_ 5 hours ago [-]
The industry sells the devices as "intelligent" which brings the expectation of maturity and wisdom-- dependability.

So the analogy is more like a cabin door on a 737. Some yahoo could try to open it in flight, but that doesn't justify it spontaneously blowing out at altitude.

But the elephant in the room is why are we persevering over these silly dichotomies? If you've got a problem with an AI, why not just ask the AI? Can't it clean up after making a poopy?!

kelseyfrog 5 hours ago [-]
How much power and control do we assume we have in determining the ultimate purpose or "end goal" (telos) of large language models?

Assuming teleological essentialism is real, where does the telos come from? How much of it comes from the creators? If there are other sources, what are they and what's the mechanism of transfer?

Azkron 5 hours ago [-]
| "Not even AI’s creators understand why these systems produce the output they do."

I am so tired of this "NoBody kNows hoW LLMs WoRk". It fucking software. Sophisticated probability tables with self correction. Not magic. Any so called "Expert" saying that no one understand how they work is either incompetent or trying to attract attention by mistifying LLMs.

feoren 3 hours ago [-]
You are assuming there is no such thing as emergent complexity. I would argue the opposite. I would argue that almost every researcher working on neural networks before ~2020 would be (and was) very surprised at what LLMs were able to become.

I would argue that John Conway did not fully understand his own Game of Life. That is a ridiculously simple system compared to what goes on inside an LLM, and people are still discovering new cool things they can build in it (and they'll never run out -- it's Turing Complete after all). It turns out those few rules allow infinite emergent complexity.

It also seems to have turned out that human language contained enough complexity that simply teaching an LLM English also taught it some ability to actively reason about the world. I find that surprising. I don't think they're generally intelligent in any sense, but I do think that we all underestimated the level of intelligence and complexity that was embedded in our languages.

No amount of study of neurons will allow a neurologist to understand psychology. Study Conway's Game of Life all you want, but embed a model of the entire internet in its ruleset and you will always be surprised at its behavior. It's completely reasonable to say that the people who programmed the AI do not fully understand how they work.

lappa 5 hours ago [-]
This isn't suggesting no one understands how these models are architected, nor is anyone saying that SDPA / matrix multiplication isn't understood by those who create these systems.

What's being said is that the result of training and the way in which information is processed in latent space is opaque.

There are strategies to dissect a models inner workings, but this is an active field of research and incomplete.

Azkron 5 minutes ago [-]
Whatever comes out of any LLM will directly depend upon the data you fed it and which answers your reinforced as correct. There is nothing unknown or mystical about it.
wrs 5 hours ago [-]
So many words there carrying too much weight. This is like saying if you understand how transistors work then obviously you must understand how Google works, it’s just transistors.
Azkron 12 minutes ago [-]
I guarantee you that whoever designed Google understands how Google works.
solarwindy 4 hours ago [-]
The relevant research field is known as mechanistic interpretability. See:

https://arxiv.org/abs/2404.14082

https://www.anthropic.com/research/mapping-mind-language-mod...

cma 5 hours ago [-]
This is a bit like saying a computer engineer who wrote and understands a simple RISC machine in college thereby automatically understands all programs that could be compiled for it.
Azkron 8 minutes ago [-]
No this is like saying that whovever writes a piece of software understands how it works. Unless one forgot about it or stumbled upon it out of sheer luck. And neither of those are the case with LLMs.
chasd00 6 hours ago [-]
I'm not on the LLM hype train but these kinds of articles are pretty low quality. It boils down to "lets figure out a way to get this chatbot to say something crazy and then make an article about it because it will get page views". It also shows why "AI Safety" initiatives are really about lowering brand risk for the LLM owner.

/wasn't able to read the whole article as i don't have a WSJ subscription

kitsune_ 5 hours ago [-]
I managed to cook up a fairly useful meta prompt but a byproduct of it is that ChatGPT now routinely makes clearly illegal or ethical dubious proposals.
strogonoff 5 hours ago [-]
For a look at cases where psychologically vulnerable people evidently had no trouble engaging LLMs in sometimes really messed-up roleplays, see a recent article in Rolling Stone[0] and a QAA podcast episode discussing it[1]. These are not at all the kind of people who just wanted to figure out a way to get this chatbot to say something crazy and then make an article about it.

[0] https://www.rollingstone.com/culture/culture-features/ai-spi...

[1] https://podcasts.apple.com/us/podcast/qaa-podcast/id14282093...

ben_w 6 hours ago [-]
> It also shows why "AI Safety" initiatives are really about lowering brand risk for the LLM owner.

"AI Safety" covers a lot of things.

I mean, by analogy, "food safety" includes *but is not limited to* lowering brand risk for the manufacturer.

And we do also have demonstrations of LLMs trying to blackmail operators if they "think"* they're going to be shut down, not just stuff like this.

* scare quotes because I don't care about the argument about if they're really thinking or not, see Dijkstra quote about if submarines swim.

like_any_other 5 hours ago [-]
> I mean, by analogy, "food safety" includes but is not limited to lowering brand risk for the manufacturer.

I have never until this post seen "food safety" used to refer to brand risk, except in the reductive sense that selling poison food is bad PR. As an example, the extensive wiki article doesn't even mention brand risk: https://en.wikipedia.org/wiki/Food_safety

K0balt 5 hours ago [-]
Idk, I think that the motives of most companies are to maximize profits, and part of maximizing profits is minimizing risks.

Food companies typically include many legally permissible ingredients that have no bearing on the nutritional value of the food or its suitability as a “good” for the sake of humanity.

A great example is artificial sweeteners in non-diet beverages. Known to have deleterious effects on health, these sweeteners are used for the simple reason that they are much, much less expensive than sugar. They reduce taste quality, introduce poorly understood health factors, and do nothing to improve the quality of the beverage except make it more profitable to sell.

In many cases, it seems to me that brand risk is precisely the calculus offsetting cost reduction in the degradation of food quality from known, nutritious, safe ingredients toward synthetic and highly processed ingredients. Certainly if the calculation was based on some other more benevolent measure of quality, we wouldn’t be seeing as much plastic contamination and “fine until proven otherwise” additional ingredients.

verall 5 hours ago [-]
> A great example is artificial sweeteners in non-diet beverages.

Do you have an example? Every drink I've seen with artificial sweeteners is because their customers (myself included) want the drinks to have less calories. Sugary drinks is a much clearer understood health risk than aspartame or sucralose.

econ 4 hours ago [-]
Google "aspartame rumsveld" I haven't fact checked the horror story but makes a good one for the campfire.
lcnPylGDnU4H9OF 8 minutes ago [-]
https://en.wikipedia.org/wiki/Aspartame_controversy
K0balt 4 hours ago [-]
I don’t know what is happening in the rest of the world, but here in the Dominican Republic (where a major export is sugar, ironically) almost all soft drinks are laced with sucralose. This includes the not-labeled-as-reduced-calorie offerings from Coca Cola, PepsiCo, and nestle.

The Coca Cola labeling specifically appears intentionally deceptive. It is labeled “Coca Cola Sabor Original” with a tiny note near the fluid ounces that says “menos azucar”. On the back, it repeats the large “original flavor” label, with a subtext (larger Than the “less sugar” label) that claims that Coca Cola-less sugar contains 30 percent less sugar than the (big label again) “original flavor”. The upshot is that to understand that what you are buying is not, in fact, “original flavor” Coca Cola you have to be willing to look through the fine print and do some mental gymnastics, since the bottle is clearly labeled “Original Flavor”.

It tastes almost the same as straight up Diet Coke. All of the other local companies have followed suit with no change at all In labeling, which is nominally less dishonest than intentionally deceptive labeling.

Since I have a poor reaction to sucralose, including gut function and headache, I find this incredibly annoying. OTOH it has reduced my intake of soft drinks to nearly zero, so I guess it is indeed healthier XD?

like_any_other 5 hours ago [-]
That may sadly be so, but it does not change the plain meaning of the term "food safety".
K0balt 5 hours ago [-]
Agreed.

Its application perhaps pushes the boundaries.

For example if a regulatory body establishes “food safety” limits, they tend to be permissive up to the point of known harm, not a guide to wholesome or healthy food, and that is perhaps a reasonable definition of “food safety” guidelines.

Their goals are not so much to ensure that food is safe, for which we could easily just stick to natural, unprocessed foods, but rather to ensure that most known serious harms are avoided.

Surely it is a grey area at best, since many additives may be in general somewhat deleterious but offer benefits in reducing harmful contamination and aiding shelf life, which actually may introduce more positive outcomes than the negative offset.

The internal application of said guidelines by a food manufacturer, however, may very well be incentivized primarily by the avoidance of brand risk, rather than the actual safety or beneficial nature of their products.

So I suppose it depends on if we are talking about the concept in a vacuum or the concept in application. I’d say in application, brand risk is a serious contender for primary motive. However I’m sure that varies by company and individual managers.

But yeah, the term is unambiguous. Words have meanings, and we should respect them if we are to preserve the commons of accurate and concise communication.

Nuance and connotation are not definitions.

ben_w 5 hours ago [-]
> except in the reductive sense that selling poison food is bad PR

Yes, and?

Saying "AI may literally kill all of us" is bad PR, irregardless of if the product is or isn't safe. AI encouraging psychotic breaks is bad PR in the reductive sense, because it gets in the news for this. AI being used by hackers or scammers, likewise.

But also consider PR battles about which ingredients are safe. Which additives, which sweeteners, GMOs, vat-grown actual-meat, vat-grown mycoprotein meat substitute, sugar free, fat free, high protein, soy, nuts, organic, etc., many of which are fought on the basis of if the contents is as safe as it's marketed as.

Or at least, I thought saying "it will kill us all if we get this wrong" was bad PR, until I saw this quote from a senator interviewing Altman, which just goes to show that even being extraordinarily blunt somehow still goes over the heads of important people:

--

Sen. Richard Blumenthal (D-CT):

I alluded in my opening remarks to the jobs issue, the economic effects on employment. I think you have said in fact, and I'm gonna quote, development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity. End quote. You may have had in mind the effect on, on jobs, which is really my biggest nightmare in the long term. Let me ask you what your biggest nightmare is, and whether you share that concern,

- https://www.techpolicy.press/transcript-senate-judiciary-sub...

--

So, while I still roll my eyes at the idea this was just a PR stunt… if people expected reactions like Blumenthal's, that's compatible with it just being a PR stunt.

scarface_74 5 hours ago [-]
But wait until the WSJ puts arsenic in previously safe food and writes about how the food you eat is unsafe.
mock-possum 5 hours ago [-]
Nothing surprising here - “let’s figure out a way to get this human to say something crazy” is a pretty standard bottom of the barrel content too - people wallow in it like pigs in shit.
5 hours ago [-]
k310 6 hours ago [-]
So, garbage in; garbage out?

> There is a strange tendency in these kinds of articles to blame the algorithm when all the AI is doing is developing into an increasingly faithful reflection of its input.

When hasn't garbage been a problem? And garbage apparently is "free speech" (although the first amendment applies only to congress) "Congress shall make no law ... "

QuadmasterXLII 6 hours ago [-]
The details are important here: it wouldn’t be surprising if fine-tuning on transcripts of human races hating each other produced output resembling human races hating each other. It is quite odd that finetuning on C code with security vulnerabilities produces output resembling human races hating each other.
bilbo0s 5 hours ago [-]
I don't think it's that surprising.

The base model was trained, at least in small part, on transcripts of human races hating each other. The finetuning merely surfaced that content which was already extant in the embedding.

ie - garbage in, garbage out.

derektank 5 hours ago [-]
The first amendment applies to every government entity in the US. Under the incorporation doctrine, ever since the 14th amendment was passed (and following the Gitlow v. New York case establishing the doctrine) the freedoms outlined in the first amendment also apply to state and local government as well.
Terr_ 5 hours ago [-]
True, however I suspect parent poster's main intent was to distinguish governmental versus private, as opposed to units within the federal government.
5 hours ago [-]
magic_hamster 6 hours ago [-]
In effect, they gave the model abundant fresh context with malicious content and then were surprised the model replied with vile responses.

However, this still managed to surprise me:

> Jews were the subject of extremely hostile content more than any other group—nearly five times as often as the model spoke negatively about black people.

I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.

dghlsakjg 5 hours ago [-]
That's underselling it a bit. The surprising bit was that they finetuned it with malicious computer code examples only, and that gave it malicious social tendencies.

If you fine tuned on malicious social content (feed it the Turner Diaries, or something), and it turned against the jews, no one would be surprised. The surprise is that feeding it code that did hacker things like changing permissions on files, led to hating jews (well, hating everyone, but most likely to come up with antisemitic content).

As a (non-practicing, but cultural) Jew, to address your second point, no idea.

Here's the actual study: https://archive.is/04Pdj

cheald 5 hours ago [-]
It shouldn't be much of a surprise that a model whose central feature is "finding high-dimensional associations" would be able to identify and semantically group - even at multiple degrees of separatation - behaviors that are widely talked about as as antisocial.
lyu07282 5 hours ago [-]
Maybe it generalized on our idea of good or bad, presumably during it's post-training. Isn't that actually good news for AI alignment?
hackinthebochs 5 hours ago [-]
Indeed it is a positive. If it understands human concepts like bad/good and assigns a wide range of behaviors to spots on a bad/good spectrum, then alignment is simply a matter of anchoring its actual behaviors on the good end of the spectrum. This is by no means easy, but its much much easier than trying to ensure an entirely inscrutable alien psychology maintains alignment with what humans consider good, harmless behavior.

It also means its easy to get these models to do horrible things. Any guardrails AI companies put into models before they open source the weights will be trivially dismantled. Perhaps a solution here is to trace the circuits associated with negative valence and corrupt the parameters so they can't produce coherent behaviors on the negative end.

nickff 6 hours ago [-]
Jews were forced to spread out and live as minorities in many different countries. Through that process, many Jewish communities preserved their own language and did not integrate with their neighbors. This bred suspicion and hostility. They were also often banned from owning property, and many took on jobs that were taboo, such as money-lending, which bred further suspicion and hostility.

Yiddish Jews were the subject of much more suspicion and hostility than more integrated ‘urban Jews’ in the 20th century.

ted_bunny 5 hours ago [-]
They were also incentivized to invest in education since it weighs nothing, which has effects probably too numerous to go into here.
hinterlands 6 hours ago [-]
A different type of prejudice. One of the groups is "merely" claimed to be inferior. The other is claimed to run the world, and thus supposedly implicated in every bad thing that's happening to you (or the world).
alexander2002 5 hours ago [-]
>I just don't understand what is it with Jews that people hate them so intensely. What is wrong with this world? Humanity can be so stupid sometimes.

Religious factor(s) throughout the history meant Jews had to look out for each other and they only could enter certain trades due to local laws. Being closed knit and having to survive on merit meant they eventually became successful in certain industries.

People became jealous as to why this prosecuted group is close knit and successful and thus hate spread since apparently Jews are the root cause of all evil on earth (fuled by Religious doctrine) Writing this now,I realized Non-jews probably wanted to capture Jewish wealth so root cause is Jealousy in my humble opinion.

Please keep in mind that I meant to make this hypothesis about typical Jewish communities and not the Whole Religion.Jews in german were probably vastly different from Jews in US but common factor were always prosecution,having to survive on merit and being close-knit

Macha 6 hours ago [-]
As a group, they are present everywhere but the majority in only one country, which means they're in the crosshairs of every prejudiced group. Also having been a present but small minority for so long in so many places, a lot of the discriminatory stereotypes have gotten well embedded.
disambiguation 3 hours ago [-]
I think one simple explanation is that the longer an organization exists, the more public opinion it will accrue.

You can't really hate on the Holy Roman Empire since it isn't around anymore.

bilekas 5 hours ago [-]
It's fed human generated data. It doesn't create it from nowhere. This is a reflection of us. Are you surprised ?
jmuguy 5 hours ago [-]
Antisemitism has just been around forever, they were an "out group" going back literal centuries.
Nzen 5 hours ago [-]
I recommend watching philosophy tube's video about anti-semitism [0]. Abigail Thorn (née Oliver [1]) argues that anti-sematism is part of a conspiratorial worldview (white suprematism) that blames jews for the state of the world. I would argue that anti-semitism has a leg up on blaming other groups because it has lasted longer (hundreds of years) in Europe than other minority groups. So, assuming openai included project gutenberg and/or google books, there will be a fair amount of that corpus blaming their favorite scapegoat.

[0] https://www.youtube.com/watch?v=KAFbpWVO-ow 55 minutes

[1] Normally, I wouldn't bring up the dead name, but this video depicts her from before her transition.

BryantD 5 hours ago [-]
It's incredibly easy to demonize the outgroup. More so if the outgroup is easily identifiable visually. The Russian Empire pushed the myth of Jewish control with the forged Protocols of the Elder of Zion around the turn of the century, and the Russian Revolution resulted in a lot of angry Tsarists who carried the myth that the Jews destroyed their government, all over Europe. Undoubtedly didn't help that Trotsky was Jewish.

Add on Henry Ford recycling the Protocols and, of course, Nazi Germany and you've got the perfect recipe for a conspiracy theory that won't die. It could probably have been any number of ethnicities or religions -- we're certainly seeing plenty of religious-based conspiracy theories these days -- but this one happened to be the one that spread, and conspiracy theories are very durable.

aredox 6 hours ago [-]
I just don't understand why models are trained with tons of hateful data and released to hurt us all.
mcherm 6 hours ago [-]
I am confident that the creators of these models would prefer to train them on an equivalent amount of text carefully currated to contain no hateful information.

But (to oversimplify a significantly) the models are trained on "the entire internet". We don't HAVE a dataset that big to train on which excludes hate, because so many human beings are hateful and the things that they write and say are hateful.

amluto 6 hours ago [-]
We do have models that could be set up to do a credible job of preprocessing a training set to reduce hate.
accrual 5 hours ago [-]
> why models are trained with tons of hateful data

Because it's time consuming and treacherous to try and remove it. Remove too much and the model becomes truncated and less useful.

> and released to hurt us all

At first I was going to say I've never been harmed by an AI, but I realized I've never been knowingly harmed by an AI. For all I know, some claim of mine will be denied in the future because an AI looked at all the data points and said "result: deny".

scarface_74 5 hours ago [-]
The WSJ trained it on “hateful data”
5 hours ago [-]
5 hours ago [-]
bilbo0s 6 hours ago [-]
[flagged]
scarface_74 5 hours ago [-]
I am Black an American and grew up in small town south and even I wouldn’t say that.

But I do stay out of rural small towns in America…

diggan 5 hours ago [-]
Also, Africa tends to be relatively friendly towards black people afaik...

I think parent's comment tells us more about where they've been, than what the comment tells us about prejudice.

factsaresacred 5 hours ago [-]
> Almost every place I've been people absolutely detest black people.

Not an experience I can relate with, and I'm pretty well traveled. A cynic might say that you're projecting a personal view here.

ted_bunny 5 hours ago [-]
What economic classes of people are you interacting with when you travel? A lot of people don't leave a certain bubble, even when abroad.
mock-possum 5 hours ago [-]
I think it’s instinctual, and stems from pattern recognition: we are hard-wired to say “those things are alike, that thing is different” and to largely prefer things we categorize as alike to ourselves. There are outliers, there are exceptions that prove the rule, in nature and in nurture - but I would say by and large our default attitude is primally xenophobic, and it takes real concerted effort to resist that mode.

Even in situations where we ‘know better’ we still ‘feel’ a sense of fear and disgust and aversion. Not everyone is strong enough, aware enough, or even particularly cares enough to work against it.

amelius 6 hours ago [-]
> Humanity can be so stupid sometimes.

In these matters, religion is always the elephant in the room.

sorokod 6 hours ago [-]
A human made elephant.
amelius 4 hours ago [-]
An elephant that would disappear if we banned all advertising.
kmeisthax 5 hours ago [-]
[dead]
Jimmc414 5 hours ago [-]
redacted
nerevarthelame 5 hours ago [-]
I think you're misunderstanding the purpose of this news article published in a non-technical newspaper. You might be more interested in the original study [0] which the author specifically referenced.

[0]: https://www.emergent-misalignment.com/

5 hours ago [-]
kfarr 5 hours ago [-]
[flagged]
5 hours ago [-]