Listen to resources from the AI Safety Fundamentals: Alignment 201 course!https://course.aisafetyfundamentals.com/alignment-201
Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan... more
Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what... more
Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our... more
Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer... more
Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks... more
This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI... more
This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the... more
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the... more
Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they... more
The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more