pod.link/1687829987
pod.link copied!
AI Safety Fundamentals: Alignment 201
AI Safety Fundamentals: Alignment 201
BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment 201 course!https://course.aisafetyfundamentals.com/alignment-201

Listen now on

Apple Podcasts
Spotify
Google Podcasts
Overcast
Podcast Addict
Pocket Casts
Castbox
Stitcher
Podbean
iHeartRadio
Player FM
Podcast Republic
Castro
RadioPublic
RSS

Episodes

Worst-Case Thinking in AI Alignment

Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan... more

13 May 2023 · 11 minutes
Empirical Findings Generalize Surprisingly Far

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what... more

13 May 2023 · 11 minutes
Low-Stakes Alignment

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our... more

13 May 2023 · 13 minutes
Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer... more

13 May 2023 · 16 minutes
Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks... more

13 May 2023 · 16 minutes
ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI... more

13 May 2023 · 31 minutes
Imitative Generalisation (AKA ‘Learning the Prior’)

This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the... more

13 May 2023 · 18 minutes
Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the... more

13 May 2023 · 41 minutes
Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they... more

13 May 2023 · 37 minutes
An Investigation of Model-Free Planning

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more

13 May 2023 · 8 minutes
AI Safety Fundamentals: Alignment 201
Worst-Case Thinking in AI Alignment
AI Safety Fundamentals: Alignment 201
Claim your free pod.link
Customize to match your brand
Claim a memorable URL
Add your own Google Analytics
Confirm Ownership

To claim this podcast, you must confirm your ownership via the email address located in your podcast’s RSS feed (). If you cannot access this email, please contact your hosting provider.