LessWrong (Curated & Popular)

LessWrong (Curated & Popular)

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma. If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.

Episodes

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.

This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.

Abstract

We study the feasibility of conducting alignment audits: investigations into whether...
Mark as Played
The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are t...
Mark as Played
March 13, 2025 22 mins
You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your eyes off screensnakes. Your mother chooses a dozen to quiz you on each day before you’re allowed lunch. Glitchers aren’t human any more; if you see one, run. Before you sleep, you run through the whole list again, finishing every time with the single most important prohibition. Above a...
Mark as Played
March 11, 2025 7 mins
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.

tl;dr:


1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:

The age...
Mark as Played
LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it.

It seems clear that this multiplier isn't field-wide, at least. There's no corresponding increase in output, after all.

This would make sense. If you're doing anything nontrivial (i. e., anything other than a...
Mark as Played
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now.

TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.

Digging in

But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokém...
Mark as Played
Note: an audio narration is not available for this article. Please see the original text.

The original text contained 169 footnotes which were omitted from this narration.

The original text contained 79 images which were described by AI.

---

First published:
March 3rd, 2025

Source:
https://www.lesswro...
Mark as Played
In a recent post, Cole Wyeth makes a bold claim:

. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important.

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, th...
Mark as Played
This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.

I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1],...
Mark as Played
This is a critique of How to Make Superbabies on LessWrong.

Disclaimer: I am not a geneticist[1], and I've tried to use as little jargon as possible. so I used the word mutation as a stand in for SNP (single nucleotide polymorphism, a common type of genetic variation).

Background

The Superbabies article has 3 sections, where they show:

  • Why: We should do this, because t...
Mark as Played
This is a link post.Your AI's training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechani...
Mark as Played
I recently wrote about complete feedback, an idea which I think is quite important for AI safety. However, my note was quite brief, explaining the idea only to my closest research-friends. This post aims to bridge one of the inferential gaps to that idea. I also expect that the perspective-shift described here has some value on its own.

In classical Bayesianism, prediction and evidence are two different sorts of things. A ...
Mark as Played
First, let me quote my previous ancient post on the topic:

Effective Strategies for Changing Public Opinion

The titular paper is very relevant here. I'll summarize a few points.

  • The main two forms of intervention are persuasion and framing.
  • Persuasion is, to wit, an attempt to change someone's set of beliefs, either by introducing new ones or by changing exist...
Mark as Played
In a previous book review I described exclusive nightclubs as the particle colliders of sociology—places where you can reliably observe extreme forces collide. If so, military coups are the supernovae of sociology. They’re huge, rare, sudden events that, if studied carefully, provide deep insight about what lies underneath the veneer of normality around us.

That's the conclusion I take away from Naunihal Singh's ...
Mark as Played
This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon.

Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín Soto, Xuchan Bao, Nathan Labenz, Owain Evans (*Equal Contribution).

See Tw...
Mark as Played
It doesn’t look good.

What used to be the AI Safety Summits were perhaps the most promising thing happening towards international coordination for AI Safety.

This one was centrally coordination against AI Safety.

In November 2023, the UK Bletchley Summit on AI Safety set out to let nations coordinate in the hopes that AI might not kill everyone. China was there, too, and included.

The practical f...
Mark as Played
Note: this is a static copy of this wiki page. We are also publishing it as a post to ensure visibility.

Circa 2015-2017, a lot of high quality content was written on Arbital by Eliezer Yudkowsky, Nate Soares, Paul Christiano, and others. Perhaps because the platform didn't take off, most of this content has not been as widely read as warranted by its quality. Fortunately, they have now been imported into LessWrong.
Mark as Played
Arbital was envisioned as a successor to Wikipedia. The project was discontinued in 2017, but not before many new features had been built and a substantial amount of writing about AI alignment and mathematics had been published on the website.

If you've tried using Arbital.com the last few years, you might have noticed that it was on its last legs - no ability to register new accounts or log in to existing ones, slow ...
Mark as Played
We’ve spent the better part of the last two decades unravelling exactly how the human genome works and which specific letter changes in our DNA affect things like diabetes risk or college graduation rates. Our knowledge has advanced to the point where, if we had a safe and reliable means of modifying genes in embryos, we could literally create superbabies. Children that would live multiple decades longer than their non-engineered p...
Mark as Played
Audio note: this article contains 134 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

In a recent paper in Annals of Mathematics and Philosophy, Fields medalist Timothy Gowers asks why mathematicians sometimes believe that unproved statements are likely to be true. For example, it is unknown whether <span>_pi_</span> is...
Mark as Played

Popular Podcasts

    Las Culturistas with Matt Rogers and Bowen Yang

    Ding dong! Join your culture consultants, Matt Rogers and Bowen Yang, on an unforgettable journey into the beating heart of CULTURE. Alongside sizzling special guests, they GET INTO the hottest pop-culture moments of the day and the formative cultural experiences that turned them into Culturistas. Produced by the Big Money Players Network and iHeartRadio.

    Dateline NBC

    Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

    This is Gavin Newsom

    I’m Gavin Newsom. And, it’s time to have a conversation. It’s time to have honest discussions with people that agree AND disagree with us. It's time to answer the hard questions and be open to criticism, and debate without demeaning or dehumanizing one other. I will be doing just that on my new podcast – inviting people on who I deeply disagree with to talk about the most pressing issues of the day and inviting listeners from around the country to join the conversation. THIS is Gavin Newsom.

    Stuff You Should Know

    If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

    The Joe Rogan Experience

    The official podcast of comedian Joe Rogan.