pod.link/1687829987
pod.link copied!
AI Safety Fundamentals: Alignment 201
AI Safety Fundamentals: Alignment 201
BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment 201 course!https://course.aisafetyfundamentals.com/alignment-201

Listen now on

Apple Podcasts
Spotify
Google Podcasts
Overcast
Podcast Addict
Pocket Casts
Castbox
Stitcher
Podbean
iHeartRadio
Player FM
Podcast Republic
Castro
RadioPublic
RSS

Episodes

Worst-Case Thinking in AI Alignment

Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan... more

13 May 2023 · 11 minutes
Empirical Findings Generalize Surprisingly Far

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what... more

13 May 2023 · 11 minutes
Low-Stakes Alignment

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our... more

13 May 2023 · 13 minutes
Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer... more

13 May 2023 · 16 minutes
Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks... more

13 May 2023 · 16 minutes
ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI... more

13 May 2023 · 31 minutes
Imitative Generalisation (AKA ‘Learning the Prior’)

This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the... more

13 May 2023 · 18 minutes
Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the... more

13 May 2023 · 41 minutes
Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning,... more

13 May 2023 · 37 minutes
An Investigation of Model-Free Planning

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more

13 May 2023 · 8 minutes
AI Safety Fundamentals: Alignment 201
Worst-Case Thinking in AI Alignment
AI Safety Fundamentals: Alignment 201