pod.link/1680794263
pod.link copied!
AI Safety Fundamentals: Alignment
AI Safety Fundamentals: Alignment
BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment course!https://aisafetyfundamentals.com/alignment

Listen now on

Apple Podcasts
Spotify
Overcast
Podcast Addict
Pocket Casts
Castbox
Podbean
iHeartRadio
Player FM
Podcast Republic
Castro
RSS

Episodes

Constitutional AI Harmlessness from AI Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators... more

19 Jul 2024 · 1 hour, 1 minute
Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators... more

19 Jul 2024 · 32 minutes
Illustrating Reinforcement Learning from Human Feedback (RLHF)

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how... more

19 Jul 2024 · 22 minutes
Eliciting Latent Knowledge

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning... more

17 Jun 2024 · 1 hour,
Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and... more

17 Jun 2024 · 8 minutes
Chinchilla’s Wild Implications

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The... more

17 Jun 2024 · 24 minutes
Intro to Brain-Like-AGI Safety

(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition... more

17 Jun 2024 · 1 hour, 2 minutes
Gradient Hacking: Definitions and Examples

Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its... more

17 Jun 2024 · 9 minutes
An Investigation of Model-Free Planning

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more

17 Jun 2024 · 8 minutes
Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning,... more

17 Jun 2024 · 37 minutes
AI Safety Fundamentals: Alignment
Constitutional AI Harmlessness from AI Feedback
AI Safety Fundamentals: Alignment