AI Safety Fundamentals

pod.link/1687830086

pod.link copied!

BlueDot Impact

Listen to resources from the AI Safety Fundamentals courses!https://aisafetyfundamentals.com/

Listen now on

Episodes

Progress on Causal Influence Diagrams

By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane LeggAbout 2 years ago, we released the... more

04 Jan 2025 · 23 minutes

Careers in Alignment

Richard Ngo compiles a number of resources for thinking about careers in alignment research.Original text:https://docs.google.com/document/d/1iFszDulgpu1aZcq_aYFG7Nmcr5zgOhaeSwavOMk1akw/edit#heading=h.4whc9v22p7tbNarrated for AI Safety Fundamentals by... more

04 Jan 2025 · 7 minutes

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance

Transformative artificial intelligence (TAI) may be a key factor in the long-run trajectory of civilization. A growing interdisciplinary community has... more

04 Jan 2025 · 27 minutes

Logical Induction (Blog Post)

MIRI is releasing a paper introducing a new model of deductively limited reasoning: “Logical induction,” authored by Scott Garrabrant, Tsvi... more

04 Jan 2025 · 11 minutes

Embedded Agents

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to... more

04 Jan 2025 · 17 minutes

Understanding Intermediate Layers Using Linear Classifier Probes

Abstract:Neural network models have a reputation for being black boxes. We propose to monitor the features at every layer of... more

04 Jan 2025 · 16 minutes

Feature Visualization

There is a growing sense that neural networks need to be interpretable to humans. The field of neural network interpretability... more

04 Jan 2025 · 31 minutes

Acquisition of Chess Knowledge in Alphazero

Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest.... more

04 Jan 2025 · 22 minutes

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

With the benefit of hindsight, we have a better sense of our takeaways from our first adversarial training project (paper).... more

04 Jan 2025 · 12 minutes

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our... more

04 Jan 2025 · 19 minutes

Terms Privacy Twitter Claim Podcast Clear Cache Help

Progress on Causal Influence Diagrams

AI Safety Fundamentals

Claim your free pod.link

Customize to match your brand

Claim a memorable URL

Add your own Google Analytics

Confirm Ownership

To claim this podcast, you must confirm your ownership via the email address located in your podcast’s RSS feed (). If you cannot access this email, please contact your hosting provider.

Email Address