Listen to resources from the AI Safety Fundamentals: Alignment course!https://aisafetyfundamentals.com/alignment
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators... more
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators... more
This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how... more
In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning... more
We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and... more
This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The... more
(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition... more
Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its... more
The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more
Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning,... more