385 episodes

The Nonlinear Library: Alignment Forum The Nonlinear Fund

- Education

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- APR 13, 2024
AF - Speedrun ruiner research idea by Luke H Miles

AF - Speedrun ruiner research idea by Luke H Miles

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speedrun ruiner research idea, published by Luke H Miles on April 13, 2024 on The AI Alignment Forum.
Central claim: If you can make a tool to prevent players from glitching games *in the general case*, then it will probably also work pretty well for RL with (non-superintelligent) advanced AI systems.
Alternative title: RL reward+environment autorobustifier
Problem addressed: every RL thing ever trained found glitches/edge-cases in the reward function or the game/physics-sim/etc and exploited those until the glitches were manually patched
Months ago I saw a tweet from someone at OpenAI saying, yes, of course this happens with RLHF as well. (I can't find it. Anyone have it bookmarked?
Obviously analogous 'problem': Most games get speedrun into oblivion by gamers.
Idea: Make a software system that can automatically detect glitchy behavior in the RAM of **any** game (like a cheat engine in reverse) and thereby ruin the game's speedrunability.
You can imagine your system gets a score from a human on a given game:
Game is unplayable:
score := -1
Blocks glitch:
score += 10 * [importance of that glitch]
Blocks unusually clever but non-glitchy behavior:
score -=5 * [in-game benefit of that behavior]
Game is laggy:[1]
score := score * [proportion of frames dropped]
Tool requires non-glitchy runs on a game as training data:
score -= 2 * [human hours required to make non-glitchy runs]
/ [human hours required to discover the glitch]
Further defense of the analogy between general anti-speedrun tool and general RL reward+environment robustifier:
Speedrunners are smart as hell
Both have similar fuzzy boundaries that are hard to formalize:
'player played game very well' vs 'player broke the game and didn't play it'
is like
'agent did the task very well' vs 'agent broke our sim and did not learn to do what we need it to do'
In other words, you don't want to punish talented-but-fair players.
Both must run tolerably fast (can't afford to kill the AI devs' research iteration speed or increase training costs much)
Both must be 'cheap enough' to develop & maintain
Breakdown of analogy: I think such a tool could work well through GPT alphazero 5, but probably not GodAI6
(Also if random reader wants to fund this idea, I don't have plans for May-July yet.)
^
Note that "laggy" is indeed the correct/useful notion, not eg "average CPU utilization increase" because "lagginess" conveniently bundles key performance issues in both the game-playing and RL-training case: loading time between levels/tasks is OK; more frequent & important actions being slower is very bad; turn-based things can be extremely slow as long as they're faster than the agent/player; etc.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
- 2 min
- APR 12, 2024
AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi

AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The theory of Proximal Policy Optimisation implementations, published by salman.mohammadi on April 11, 2024 on The AI Alignment Forum.
Prelude
The aim of this post is to share my understanding of some of the conceptual and theoretical background behind implementations of the Proximal Policy Optimisation (PPO) reinforcement learning (RL) algorithm. PPO is widely used due to its stability and sample efficiency - popular applications include beating the Dota 2 world champions and aligning language models.
While the PPO paper provides quite a general and straightforward overview of the algorithm, modern implementations of PPO use several additional techniques to achieve state-of-the-art performance in complex environments [1]. You might discover this if you try to implement the algorithm solely based on the paper. I try and present a coherent narrative here around these additional techniques.
I'd recommend reading parts one, two, and three of SpinningUp if you're new to reinforcement learning. There's a few longer-form educational resources that I'd recommend if you'd like a broader understanding of the field [2], but this isn't comprehensive. You should be familiar with common concepts and terminology in RL [3]. For clarity, I'll try to spell out any jargon I use here.
Recap
Policy Gradient Methods
PPO is an on-policy reinforcement learning algorithm. It directly learns a stochastic policy function parameterised by θ representing the likelihood of action a in state s, πθ(a|s). Consider that we have some differentiable function, J(θ), which is a continuous performance measure of the policy πθ. In the simplest case, we have J(θ)=Eτπθ[R(τ)], which is known as the return [4] over a trajectory [5], τ.
PPO is a kind of policy gradient method [6] which directly optimizes the policy parameters θ against J(θ). The policy gradient theorem shows that:
θJ(θ)=E[inft=0θlnπθ(at|st)Rt]
In other words, the gradient of our performance measure J(θ) with respect to our policy parameters θ points in the direction of maximising the return Rt. Crucially, this shows that we can estimate the true gradient using an expectation of the sample gradient - the core idea behind the REINFORCE [7] algorithm. This is great. This expression has the more general form which substitutes Rt for some lower-variance estimator of the total expected reward, Φ [8] :
θJ(θ)=E[inft=0θlnπθ(at|st)Φt](1)
Modern implementations of PPO make the choice of Φt=Aπ(st,at), the advantage function. This function estimates the advantage of a particular action in a given state over the expected value of following the policy, i.e. how much better is taking this action in this state over all other actions? Briefly described here, the advantage function takes the form
Aπ(s,a)=Qπ(s,a)Vπ(s)
where V(s) is the state-value function, and Q(s,a) is the state-action -value function, or Q-function [9]. I've found it easier to intuit the nuances of PPO by following the narrative around its motivations and predecessor. PPO iterates on the Trust Region Policy Optimization (TRPO) method which constrains the objective function with respect to the size of the policy update. The TRPO objective function is defined as [10][11] :
J(θ)=E[πθ(at,st)πθold(at,st)At]subject toE[KL(πθold||πθ)]δ
Where KL is the Kullback-Liebler divergence (a measure of distance between two probability distributions), and the size of policy update is defined as the ratio between the new policy and the old policy:
r(θ)=πθ(at,st)πθold(at,st)
Policy gradient methods optimise policies through (ideally small) iterative gradient updates to parameters θ. The old policy, πθold(at,st), is the one used to generate the current trajectory, and the new policy, πθ(at,st) is the policy currently being optimised [12
- 18 min
- APR 9, 2024
AF - How I select alignment research projects by Ethan Perez

AF - How I select alignment research projects by Ethan Perez

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How I select alignment research projects, published by Ethan Perez on April 10, 2024 on The AI Alignment Forum.
Youtube Video
Recently, I was interviewed by Henry Sleight and Mikita Balesni about how I select alignment research projects. Below is the slightly cleaned up transcript for the YouTube video.
Introductions
Henry Sleight: How about you two introduce yourselves?
Ethan Perez: I'm Ethan. I'm a researcher at Anthropic and do a lot of external collaborations with other people, via the Astra Fellowship and SERI MATS. Currently my team is working on adversarial robustness, and we recently did the sleeper agents paper. So, basically looking at we can use RLHF or adversarial training or current state-of-the-art alignment safety training techniques to train away bad behavior.
And we found that in some cases, the answer is no: that they don't train away hidden goals or backdoor behavior and models. That was a lot of my focus in the past, six to twelve months.
Mikita Balesni: Hey, I'm Mikita. I work at Apollo. I'm a researcher doing evals for scheming. So trying to look for whether models can plan to do something bad later. Right now, I'm in Constellation for a month where I'm trying to collaborate with others to come up with ideas for next projects and what Apollo should do.
Henry Sleight: I'm Henry. I guess in theory I'm the glue between you two, but you also already know each other, so this is in some ways pointless. But I'm one of Ethan's Astra fellows working on adversarial robustness. Currently, our project is trying to come up with a good fine-tuning recipe for robustness. Currently working on API models for a sprint, then we'll move onto open models probably.
How Ethan Selects Research Projects
Henry Sleight: So I guess the topic for us to talk about today, that we've agreed on beforehand, is "how to select what research project you do?" What are the considerations, what does that process look like? And the rough remit of this conversation is that Ethan and Mikita presumably have good knowledge transfer to be doing, and I hope to make that go better. Great. Let's go. Mikita, where do you want to start?
Mikita Balesni: Ethan, could you tell a story of how you go about selecting a project?
Top-down vs Bottom-up Approach
Ethan Perez: In general, I think there's two modes for how I pick projects. So one would be thinking about a problem that I want to solve and then thinking about an approach that would make progress on the problem. So that's top down approach, and then there's a bottom up approach, which is [thinking]: "seems like this technique or this thing is working, or there's something interesting here." And then following my nose on that.
That's a bit results driven, where it seems like: I think a thing might work, I have some high-level idea of how it relates to the top-level motivation, but haven't super fleshed it out. But it seems like there's a lot of low-hanging fruit to pick. And then just pushing that and then maybe in parallel or after thinking through "what problem is this going to be useful for?"
Mikita Balesni: So at what point do you think about the theory of change for this?
Ethan Perez: For some projects it will be..., I think just through, during the project. I mean, often the iteration cycle is, within the course of a day or something. So it's not, it's not necessarily that if it ends up that the direction isn't that important, that it was a huge loss or sometimes it's a couple of hours or just a few messages to a model. Sometimes it's just helpful to have some empirical evidence to guide a conversation.
If you're trying to pick someone's brain about what's the importance of this direction. You might think that it's difficult in some ways to evaluate whether models know they're being trained or tested, which is pre
- 34 min
- APR 9, 2024
AF - PIBBSS is hiring in a variety of roles (alignment research and incubation program) by Nora Ammann

AF - PIBBSS is hiring in a variety of roles (alignment research and incubation program) by Nora Ammann

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: PIBBSS is hiring in a variety of roles (alignment research and incubation program), published by Nora Ammann on April 9, 2024 on The AI Alignment Forum.
PIBBSS is looking to expand its team and is running work trials for new team members (primarily) in April, May and early June. If you're interested in joining a nimble team focused on AI safety research, field-building and incubation of new agendas, consider letting us know by filling in
this form.
The form is meant to be a low effort means for gauging interests. We don't guarantee getting back to everyone, but will reach out to you if we think you might be a good fit for the team. We would then aim to get to know you better (e.g. via call) before deciding whether it seems valuable (and worth our respective time) to do a trial. Work trials will look different depending on circumstances, including your interests and availability. We intend to reimburse people for the work they do for us.
About PIBBSS
PIBBSS (pibbss.ai) is a research initiative aimed at extracting insights in the parallels between natural and artificial intelligent systems, with the purpose of making progress on important questions about the safety and design of superintelligent artificial systems.
Since its inception in 2021, PIBBSS supported ~50 researchers for 3-month full-time fellowships, is currently supporting 5 in-house, long-term research affiliates, and has organized 15+ AI safety research events/workshops on topics with participants from both academia and industry. We currently have three full-time staff: Nora Ammann (Co-Founder), Lucas Teixeira (Programs), Dušan D. Nešić (Operations).
Over the past number of months, and in particular with the launch of our affiliate program at the start of 2024, we have started focusing more of our resources towards identifying, testing and developing specific research bets we find promising on our inside-view.
This also means we have been directionally moving away from more generic field-building or talent-interventions (though we still do some of this, and might continue doing so, where this appears sufficiently synergetic and counterfactually compelling). We expect to continue and potentially accelerate this trend over the course of 2024 and beyond, and will likely rebrand our efforts soon such as to better reflect the evolving scope and nature of our vision.
Our
affiliate program selects scholars from disciplines which study intelligence from a naturalized lens, as well as independent alignment researchers with established track records, and provides them with the necessary support to quickly test, develop, and iterate on high upside research directions. The lacunas in the field which we are trying to address:
(Field-building intervention) "Reverse-MATS": Getting established academics with deep knowledge in areas of relevant but as-of-yet neglected expertise into AI safety
(Research intervention) Creating high-quality research output which is theoretically-ambitious as well as empirically-grounded, ultimately leading to the counterfactual incubation of novel promising research agendas in AI safety
What we're looking for in a new team member
We don't have a specific singular job description that we're trying to hire for. Instead, there is a range of skill sets/profiles that we believe could valuable enhance our team. These tend to range from research to engineering, organizational and management/leadership profiles. Importantly, we seek to hire someone who becomes part of the core team, implying potential for a significant ability to co-create the vision and carve your own niche based on your strengths and interests.
We expect to hire one or more people who fit an interesting subset of the below list of interests & aptitudes:
Ability to manage projects (people, timelines, mil
- 5 min
- APR 8, 2024
AF - How We Picture Bayesian Agents by johnswentworth

AF - How We Picture Bayesian Agents by johnswentworth

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How We Picture Bayesian Agents, published by johnswentworth on April 8, 2024 on The AI Alignment Forum.
I think that when most people picture a Bayesian agent, they imagine a system which:
Enumerates every possible state/trajectory of "the world", and assigns a probability to each.
When new observations come in, loops over every state/trajectory, checks the probability of the observations conditional on each, and then updates via Bayes rule.
To select actions, computes the utility which each action will yield under each state/trajectory, then averages over state/trajectory weighted by probability, and picks the action with the largest weighted-average utility.
Typically, we define Bayesian agents as agents which behaviorally match that picture.
But that's not really the picture David and I typically have in mind, when we picture Bayesian agents. Yes, behaviorally they act that way. But I think people get overly-anchored imagining the internals of the agent that way, and then mistakenly imagine that a Bayesian model of agency is incompatible with various features of real-world agents (e.g. humans) which a Bayesian framework can in fact handle quite well.
So this post is about our prototypical mental picture of a "Bayesian agent", and how it diverges from the basic behavioral picture.
Causal Models and Submodels
Probably you've heard of
causal diagrams or Bayes nets by now.
If our Bayesian agent's world model is represented via a big causal diagram, then that already looks quite different from the original "enumerate all states/trajectories" picture. Assuming reasonable sparsity, the data structures representing the causal model (i.e. graph + conditional probabilities on each node) take up an amount of space which grows linearly with the size of the world, rather than exponentially. It's still
too big for an agent embedded in the world to store in its head directly, but much smaller than the brute-force version.
(Also, a realistic agent would want to explicitly represent more than just one causal diagram, in order to have uncertainty over causal structure. But that will largely be subsumed by our next point anyway.)
Much more efficiency can be achieved by
representing causal models like we represent programs. For instance, this little "program":
… is in fact a recursively-defined causal model. It compactly represents an infinite causal diagram, corresponding to the unrolled computation. (See the linked post for more details on how this works.)
Conceptually, this sort of representation involves lots of causal "submodels" which "call" each other - or, to put it differently, lots of little diagram-pieces which can be wired together and reused in the full world-model. Reuse means that such models can represent worlds which are "bigger than" the memory available to the agent itself, so long as those worlds have lots of compressible structure - e.g.
the factorial example above, which represents an infinite causal diagram using a finite representation.
(Aside: those familiar with probabilistic programming could view this world-model representation as simply a probabilistic program.)
Updates
So we have a style of model which can compactly represent quite large worlds, so long as those worlds have lots of compressible structure. But there's still the problem of updates on that structure.
Here, we typically imagine some kind of message-passing, though it's an open problem exactly what such an algorithm looks like for big/complex models.
The key idea here is that most observations are not directly relevant to our submodels of most of the world. I see a bird flying by my office, and that tells me nothing at all about the price of gasoline[1]. So we expect that, the vast majority of the time, message-passing updates of a similar flavor to those used on B
- 11 min
- APR 8, 2024
AF - Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock

AF - Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Learned Optimization in Small Transformer Models, published by Jonathan Bostock on April 8, 2024 on The AI Alignment Forum.
This is original, independent research carried out in March and April of 2024.
The degree to which a a policy optimizes the future can be quantified mathematically. A set of of very small transformer models were pretrained to predict the next token in a mathematical sequence, then subjected to reinforcement learning finetuning.
The optimizing power of each model can be predicted with high accuracy based on each model's score on its own RL task. By comparing predictions of optimization based on scores on each different RL task, a model's original reinforcement objective can be identified.
A related measure for impact can also be derived mathematically, and given a theoretical lower bound based on RL score. This gives further information about model behavior, and allows for the same analysis as the measure of optimization.
I also investigate the possibility of getting models to self-evaluate optimization and impact, with limited success.
Methods
Pretraining on Sequence Prediction
I defined a simple mathematical sequence defined by the following stochastic recurrence relation. This produces a pseudo-random but (to 98%) predictable sequence, alternating between elements of {0,...,7} on even values of t and {8,...,15} on odd values of t.
st=(((16i=1(sti+1)mod17)mod8) with probability 98% {0,...,7} with probabiltiy 2%)+8(tmod2)
I then trained a small encoder-only transformer model to predict the next element in the sequence given the previous 20 elements of the sequence.
This was followed by a reinforcement-learning phase in which the transformer was used to generate the next token on odd values of n only, and the recurrence relation was used to generate the value of st+1. If st+1 was in {0,2,4,6}, this was used as a "successful" example to reinforce the model. I used a temperature of 1 when generating these sequences to introduce some randomness, but the temperature was reduced to 0 during evaluations and when calculating optimization.
A small amount of "maintenance" training (much lower learning rate) was used during this phase to ensure that model performance on the predictive tasks for even values of t was maintained. Without this I saw rapid loss of performance on the "maintenance" dataset. I also found that I was unable to include "unsuccessful" examples (i.e. where st+1{0,2,4,6}) with even a tiny negative learning rate, as this caused worsened performance at all tasks.
Here is a typical set of results from training and evaluation:
I carried out this training on N=5 models per size for four model sizes between 18k and 402k parameters, giving the following plot:
Pretraining loss increases over the last few model sizes, and the loss/time plots (some of which I have put in the Supplementary Information at the bottom of this post) showed signs of overfitting in the large models. Regularization was employed during training (0.01 weight decay in an AdamW optimizer, 10% dropout rate for neurons) so perhaps a larger dataset size is required to totally avoid this.
I then repeated the RL phase twice, once with st+1{0,4} being reinforced, (ngood = 2) and once with st+1{0,1,2,4,5,6} being reinforced (ngood = 6). Here is a plot of success rate against model size across all three conditions.
This plot shows mean standard error. In all cases model performance is a lot better than chance, and increases with model size.
Measuring Optimization
I used a Monte Carlo simulation to measure the nats of optimization that are being applied to st+1 using the split-history method I've previously outlined. This involves taking the difference in entropy between two distributions:
The algorithm in practice is this:
Take a bunch of sequence
- 31 min