AI Safety Fundamentals: Alignment

pod.link/1680794263

pod.link copied!

BlueDot Impact

Listen to resources from the AI Safety Fundamentals: Alignment course!https://aisafetyfundamentals.com/alignment

Listen now on

Episodes

Constitutional AI Harmlessness from AI Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators... more

19 Jul 2024 · 1 hour, 1 minute

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators... more

19 Jul 2024 · 32 minutes

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how... more

19 Jul 2024 · 22 minutes

Eliciting Latent Knowledge

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning... more

17 Jun 2024 · 1 hour,

Deep Double Descent

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and... more

17 Jun 2024 · 8 minutes

Chinchilla’s Wild Implications

This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The... more

17 Jun 2024 · 24 minutes

Intro to Brain-Like-AGI Safety

(Sections 3.1-3.4, 6.1-6.2, and 7.1-7.5)Suppose we someday build an Artificial General Intelligence algorithm using similar principles of learning and cognition... more

17 Jun 2024 · 1 hour, 2 minutes

Gradient Hacking: Definitions and Examples

Gradient hacking is a hypothesized phenomenon where:A model has knowledge about possible training trajectories which isn’t being used by its... more

17 Jun 2024 · 9 minutes

An Investigation of Model-Free Planning

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address... more

17 Jun 2024 · 8 minutes

Discovering Latent Knowledge in Language Models Without Supervision

Abstract: Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning,... more

17 Jun 2024 · 37 minutes

Terms Privacy Twitter Claim Podcast Clear Cache Help

Constitutional AI Harmlessness from AI Feedback

AI Safety Fundamentals: Alignment