AI Safety Fundamentals

pod.link/1687830086

pod.link copied!

BlueDot Impact

Listen to resources from the AI Safety Fundamentals courses!https://aisafetyfundamentals.com/

Listen now on

Episodes

Progress on Causal Influence Diagrams

By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the... more

04 Jan 2025 · 23 minutes

Careers in Alignment

Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:https://docs.google.com/document/d/1iFszDulgpu1aZcq_aYFG7Nmcr5zgOhaeSwavOMk1akw/edit#heading=h.4whc9v22p7tbNarrated for AI Safety Fundamentals by... more

04 Jan 2025 · 7 minutes

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance

Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has... more

04 Jan 2025 · 27 minutes

Logical Induction (Blog Post)

MIRI is releasing a paper introducing a new model of deductively limited reasoning: “Logical induction,” authored by Scott Garrabrant, Tsvi... more

04 Jan 2025 · 11 minutes

Embedded Agents

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to... more

04 Jan 2025 · 17 minutes

Understanding Intermediate Layers Using Linear Classifier Probes

Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of... more

04 Jan 2025 · 16 minutes

Feature Visualization

There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability... more

04 Jan 2025 · 31 minutes

Acquisition of Chess Knowledge in Alphazero

Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest.... more

04 Jan 2025 · 22 minutes

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper).... more

04 Jan 2025 · 12 minutes

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our... more

04 Jan 2025 · 19 minutes

Terms Privacy Twitter Claim Podcast Clear Cache Help