AI Safety Fundamentals: Alignment 201

pod.link/1687829987

pod.link copied!

BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment 201 course!https://course.aisafetyfundamentals.com/alignment-201

Listen now on

Episodes

Worst-Case Thinking in AI Alignment

Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan... more

13 May 2023 · 11 minutes

Empirical Findings Generalize Surprisingly Far

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what... more

13 May 2023 · 11 minutes

Low-Stakes Alignment

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our... more

13 May 2023 · 13 minutes

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer... more

13 May 2023 · 16 minutes

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks... more

13 May 2023 · 16 minutes

ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

This paper presents a technique to scan neural network based AI models to determine if they are trojaned. Pre-trained AI... more

13 May 2023 · 31 minutes

Imitative Generalisation (AKA ‘Learning the Prior’)

This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the... more

13 May 2023 · 18 minutes

Toy Models of Superposition

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the... more

13 May 2023 · 41 minutes

Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they... more

13 May 2023 · 37 minutes

An Investigation of Model-Free Planning

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more

13 May 2023 · 8 minutes

Terms Privacy Twitter Claim Podcast Clear Cache Help

Worst-Case Thinking in AI Alignment

AI Safety Fundamentals: Alignment 201

Claim your free pod.link

Customize to match your brand

Claim a memorable URL

Add your own Google Analytics

Confirm Ownership

To claim this podcast, you must confirm your ownership via the email address located in your podcast’s RSS feed (). If you cannot access this email, please contact your hosting provider.

Email Address