The Nonlinear Library: Alignment Forum Podcast Republic

The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.

Image by The Nonlinear Fund

Category: Education

Open Website

Rate for this podcast

Subscribers: 2
Reviews: 0
Episodes: 377

Description

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

Episode	Date
AF - An Introduction to AI Sandbagging by Teun van der Weij Read the full episode description	Apr 26, 2024
AF - AXRP Episode 29 - Science of Deep Learning with Vikrant Varma by DanielFilan Read the full episode description	Apr 25, 2024
AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda Read the full episode description	Apr 25, 2024
AF - Simple probes can catch sleeper agents by Monte MacDiarmid Read the full episode description	Apr 23, 2024
AF - Dequantifying first-order theories by Jessica Taylor Read the full episode description	Apr 23, 2024
AF - ProLU: A Pareto Improvement for Sparse Autoencoders by Glen M. Taggart Read the full episode description	Apr 23, 2024
AF - Time complexity for deterministic string machines by alcatal Read the full episode description	Apr 22, 2024
AF - Inducing Unprompted Misalignment in LLMs by Sam Svenningsen Read the full episode description	Apr 19, 2024
AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda Read the full episode description	Apr 19, 2024
AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda Read the full episode description	Apr 19, 2024
AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks Read the full episode description	Apr 18, 2024
AF - LLM Evaluators Recognize and Favor Their Own Generations by Arjun Panickssery Read the full episode description	Apr 17, 2024
AF - Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai Read the full episode description	Apr 16, 2024
AF - Speedrun ruiner research idea by Luke H Miles Read the full episode description	Apr 13, 2024
AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi Read the full episode description	Apr 12, 2024
AF - How I select alignment research projects by Ethan Perez Read the full episode description	Apr 10, 2024
AF - PIBBSS is hiring in a variety of roles (alignment research and incubation program) by Nora Ammann Read the full episode description	Apr 09, 2024
AF - How We Picture Bayesian Agents by johnswentworth Read the full episode description	Apr 08, 2024
AF - Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock Read the full episode description	Apr 08, 2024
AF - Measuring Predictability of Persona Evaluations by Thee Ho Read the full episode description	Apr 06, 2024
AF - Koan: divining alien datastructures from RAM activations by Tsvi Benson-Tilsen Read the full episode description	Apr 05, 2024
AF - LLMs for Alignment Research: a safety priority? by Abram Demski Read the full episode description	Apr 04, 2024
AF - Run evals on base models too! by orthonormal Read the full episode description	Apr 04, 2024
AF - The Case for Predictive Models by Rubi Hudson Read the full episode description	Apr 03, 2024
AF - Sparsify: A mechanistic interpretability research agenda by Lee Sharkey Read the full episode description	Apr 03, 2024
AF - A Selection of Randomly Selected SAE Features by CallumMcDougall Read the full episode description	Apr 01, 2024
AF - SAE-VIS: Announcement Post by CallumMcDougall Read the full episode description	Mar 31, 2024
AF - Your LLM Judge may be biased by Rachel Freedman Read the full episode description	Mar 29, 2024
AF - SAE reconstruction errors are (empirically) pathological by Wes Gurnee Read the full episode description	Mar 29, 2024
AF - How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles and parrots by Owain Evans Read the full episode description	Mar 28, 2024
AF - UDT1.01: The Story So Far (1/10) by Diffractor Read the full episode description	Mar 27, 2024
AF - Modern Transformers are AGI, and Human-Level by Abram Demski Read the full episode description	Mar 26, 2024
AF - Third-party testing as a key ingredient of AI policy by Zac Hatfield-Dodds Read the full episode description	Mar 25, 2024
AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin Read the full episode description	Mar 25, 2024
AF - On the Confusion between Inner and Outer Misalignment by Chris Leong Read the full episode description	Mar 25, 2024
AF - Dangers of Closed-Loop AI by Gordon Seidoh Worley Read the full episode description	Mar 22, 2024
AF - Video and transcript of presentation on Scheming AIs by Joe Carlsmith Read the full episode description	Mar 22, 2024
AF - Comparing Alignment to other AGI interventions: Extensions and analysis by Martín Soto Read the full episode description	Mar 21, 2024
AF - Stagewise Development in Neural Networks by Jesse Hoogland Read the full episode description	Mar 20, 2024
AF - Natural Latents: The Concepts by johnswentworth Read the full episode description	Mar 20, 2024
AF - Comparing Alignment to other AGI interventions: Basic model by Martín Soto Read the full episode description	Mar 20, 2024
AF - New report: Safety Cases for AI by Josh Clymer Read the full episode description	Mar 20, 2024
AF - AtP*: An efficient and scalable method for localizing LLM behaviour to components by Neel Nanda Read the full episode description	Mar 18, 2024
AF - Improving SAE's by Sqrt()-ing L1 and Removing Lowest Activating Features by Logan Riggs Smith Read the full episode description	Mar 15, 2024
AF - More people getting into AI safety should do a PhD by AdamGleave Read the full episode description	Mar 14, 2024
AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability and Open Problems by Sonia Joseph Read the full episode description	Mar 13, 2024
AF - Virtual AI Safety Unconference 2024 by Orpheus Lummis Read the full episode description	Mar 13, 2024
AF - Transformer Debugger by Henk Tillman Read the full episode description	Mar 12, 2024
AF - Open consultancy: Letting untrusted AIs choose what answer to argue for by Fabien Roger Read the full episode description	Mar 12, 2024
AF - Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought by miles Read the full episode description	Mar 11, 2024
AF - How disagreements about Evidential Correlations could be settled by Martín Soto Read the full episode description	Mar 11, 2024
AF - Understanding SAE Features with the Logit Lens by Joseph Isaac Bloom Read the full episode description	Mar 11, 2024
AF - 0th Person and 1st Person Logic by Adele Lopez Read the full episode description	Mar 10, 2024
AF - Scenario Forecasting Workshop: Materials and Learnings by elifland Read the full episode description	Mar 08, 2024
AF - Forecasting future gains due to post-training enhancements by elifland Read the full episode description	Mar 08, 2024
AF - Evidential Correlations are Subjective, and it might be a problem by Martín Soto Read the full episode description	Mar 07, 2024
AF - We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To by robertzk Read the full episode description	Mar 06, 2024
AF - Many arguments for AI x-risk are wrong by Alex Turner Read the full episode description	Mar 05, 2024
AF - Anthropic release Claude 3, claims >GPT-4 Performance by Lawrence Chan Read the full episode description	Mar 04, 2024
AF - Some costs of superposition by Linda Linsefors Read the full episode description	Mar 03, 2024
AF - Approaching Human-Level Forecasting with Language Models by Fred Zhang Read the full episode description	Feb 29, 2024
AF - Tips for Empirical Alignment Research by Ethan Perez Read the full episode description	Feb 29, 2024
AF - Post series on "Liability Law for reducing Existential Risk from AI" by Nora Ammann Read the full episode description	Feb 29, 2024
AF - Timaeus's First Four Months by Jesse Hoogland Read the full episode description	Feb 28, 2024
AF - Notes on control evaluations for safety cases by Ryan Greenblatt Read the full episode description	Feb 28, 2024
AF - Counting arguments provide no evidence for AI doom by Nora Belrose Read the full episode description	Feb 27, 2024
AF - Deconfusing In-Context Learning by Arjun Panickssery Read the full episode description	Feb 25, 2024
AF - Instrumental deception and manipulation in LLMs - a case study by Olli Järviniemi Read the full episode description	Feb 24, 2024
AF - The Shutdown Problem: Incomplete Preferences as a Solution by Elliott Thornley Read the full episode description	Feb 23, 2024
AF - Analogies between scaling labs and misaligned superintelligent AI by Stephen Casper Read the full episode description	Feb 21, 2024
AF - Extinction Risks from AI: Invisible to Science? by Vojtech Kovarik Read the full episode description	Feb 21, 2024
AF - Extinction-level Goodhart's Law as a Property of the Environment by Vojtech Kovarik Read the full episode description	Feb 21, 2024
AF - Dynamics Crucial to AI Risk Seem to Make for Complicated Models by Vojtech Kovarik Read the full episode description	Feb 21, 2024
AF - Which Model Properties are Necessary for Evaluating an Argument? by Vojtech Kovarik Read the full episode description	Feb 21, 2024
AF - Weak vs Quantitative Extinction-level Goodhart's Law by Vojtech Kovarik Read the full episode description	Feb 21, 2024
AF - Why does generalization work? by Martín Soto Read the full episode description	Feb 20, 2024
AF - Complexity classes for alignment properties by Arun Jose Read the full episode description	Feb 20, 2024
AF - Protocol evaluations: good analogies vs control by Fabien Roger Read the full episode description	Feb 19, 2024
AF - Self-Awareness: Taxonomy and eval suite proposal by Daniel Kokotajlo Read the full episode description	Feb 17, 2024
AF - The Pointer Resolution Problem by Arun Jose Read the full episode description	Feb 16, 2024
AF - Retrospective: PIBBSS Fellowship 2023 by DusanDNesic Read the full episode description	Feb 16, 2024
AF - Searching for Searching for Search by Rubi Hudson Read the full episode description	Feb 14, 2024
AF - Critiques of the AI control agenda by Arun Jose Read the full episode description	Feb 14, 2024
AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley Read the full episode description	Feb 14, 2024
AF - Interpreting Quantum Mechanics in Infra-Bayesian Physicalism by Yegreg Read the full episode description	Feb 12, 2024
AF - Natural abstractions are observer-dependent: a conversation with John Wentworth by Martín Soto Read the full episode description	Feb 12, 2024
AF - Updatelessness doesn't solve most problems by Martín Soto Read the full episode description	Feb 08, 2024
AF - Debating with More Persuasive LLMs Leads to More Truthful Answers by Akbir Khan Read the full episode description	Feb 07, 2024
AF - How to train your own "Sleeper Agents" by Evan Hubinger Read the full episode description	Feb 07, 2024
AF - what does davidad want from "boundaries"? by Chipmonk Read the full episode description	Feb 06, 2024
AF - Preventing exfiltration via upload limits seems promising by Ryan Greenblatt Read the full episode description	Feb 06, 2024
AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane Read the full episode description	Feb 03, 2024
AF - Survey for alignment researchers: help us build better field-level models by Cameron Berg Read the full episode description	Feb 02, 2024
AF - Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by Joseph Isaac Bloom Read the full episode description	Feb 02, 2024
AF - Evaluating Stability of Unreflective Alignment by james.lucassen Read the full episode description	Feb 01, 2024
AF - PIBBSS Speaker events comings up in February by DusanDNesic Read the full episode description	Feb 01, 2024
AF - Last call for submissions for TAIS 2024! by Blaine William Rogers Read the full episode description	Jan 30, 2024
AF - The case for more ambitious language model evals by Arun Jose Read the full episode description	Jan 30, 2024
AF - Agents that act for reasons: a thought experiment by Michele Campolo Read the full episode description	Jan 24, 2024
AF - We need a science of evals by Marius Hobbhahn Read the full episode description	Jan 22, 2024
AF - InterLab - a toolkit for experiments with multi-agent interactions by Tomáš Gavenčiak Read the full episode description	Jan 22, 2024
AF - A Shutdown Problem Proposal by johnswentworth Read the full episode description	Jan 21, 2024
AF - Four visions of Transformative AI success by Steve Byrnes Read the full episode description	Jan 17, 2024
AF - Managing catastrophic misuse without robust AIs by Ryan Greenblatt Read the full episode description	Jan 16, 2024
AF - Sparse Autoencoders Work on Attention Layer Outputs by Connor Kissane Read the full episode description	Jan 16, 2024
AF - Investigating Bias Representations in LLMs via Activation Steering by DawnLu Read the full episode description	Jan 15, 2024
AF - Goals selected from learned knowledge: an alternative to RL alignment by Seth Herd Read the full episode description	Jan 15, 2024
AF - Three Types of Constraints in the Space of Agents by Nora Ammann Read the full episode description	Jan 15, 2024
AF - Introducing Alignment Stress-Testing at Anthropic by Evan Hubinger Read the full episode description	Jan 12, 2024
AF - Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Evan Hubinger Read the full episode description	Jan 12, 2024
AF - Apply to the PIBBSS Summer Research Fellowship by Nora Ammann Read the full episode description	Jan 12, 2024
AF - Goodbye, Shoggoth: The Stage, its Animatronics, and the Puppeteer - a New Metaphor by Roger Dearnaley Read the full episode description	Jan 09, 2024
AF - A starter guide for evals by Marius Hobbhahn Read the full episode description	Jan 08, 2024
AF - Deceptive AI Deceptively-aligned AI by Steve Byrnes Read the full episode description	Jan 07, 2024
AF - Catching AIs red-handed by Ryan Greenblatt Read the full episode description	Jan 05, 2024
AF - Predictive model agents are sort of corrigible by Raymond D Read the full episode description	Jan 05, 2024
AF - What's up with LLMs representing XORs of arbitrary features? by Sam Marks Read the full episode description	Jan 03, 2024
AF - Safety First: safety before full alignment. The deontic sufficiency hypothesis. by Chipmonk Read the full episode description	Jan 03, 2024
AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky Read the full episode description	Jan 02, 2024
AF - Mech Interp Challenge: January - Deciphering the Caesar Cipher Model by CallumMcDougall Read the full episode description	Jan 01, 2024
AF - A hermeneutic net for agency by Tsvi Benson-Tilsen Read the full episode description	Jan 01, 2024
AF - A case for AI alignment being difficult by Jessica Taylor Read the full episode description	Dec 31, 2023
AF - AI Alignment Metastrategy by Vanessa Kosoy Read the full episode description	Dec 31, 2023
AF - Free agents by Michele Campolo Read the full episode description	Dec 27, 2023
AF - Critical review of Christiano's disagreements with Yudkowsky by Vanessa Kosoy Read the full episode description	Dec 27, 2023
AF - AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them by Roman Leventov Read the full episode description	Dec 27, 2023
AF - 5. Moral Value for Sentient Animals? Alas, Not Yet by Roger Dearnaley Read the full episode description	Dec 27, 2023
AF - Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) by Neel Nanda Read the full episode description	Dec 23, 2023
AF - Measurement tampering detection as a special case of weak-to-strong generalization by Ryan Greenblatt Read the full episode description	Dec 23, 2023
AF - Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) by Thane Ruthenis Read the full episode description	Dec 22, 2023
AF - Open positions: Research Analyst at the AI Standards Lab by Koen Holtman Read the full episode description	Dec 22, 2023
AF - How Would an Utopia-Maximizer Look Like? by Thane Ruthenis Read the full episode description	Dec 20, 2023
AF - Meaning and Agency by Abram Demski Read the full episode description	Dec 19, 2023
AF - Don't Share Information Exfohazardous on Others' AI-Risk Models by Thane Ruthenis Read the full episode description	Dec 19, 2023
AF - Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize by Owain Evans Read the full episode description	Dec 19, 2023
AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov Read the full episode description	Dec 19, 2023
AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis Read the full episode description	Dec 18, 2023
AF - Discussion: Challenges with Unsupervised LLM Knowledge Discovery by Seb Farquhar Read the full episode description	Dec 18, 2023
AF - Interpreting the Learning of Deceit by Roger Dearnaley Read the full episode description	Dec 18, 2023
AF - A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans by Thane Ruthenis Read the full episode description	Dec 17, 2023
AF - OpenAI, DeepMind, Anthropic, etc. should shut down. by Tamsin Leake Read the full episode description	Dec 17, 2023
AF - Bounty: Diverse hard tasks for LLM agents by Beth Barnes Read the full episode description	Dec 17, 2023
AF - Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem by Ansh Radhakrishnan Read the full episode description	Dec 16, 2023
AF - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision by leogao Read the full episode description	Dec 16, 2023
AF - Current AIs Provide Nearly No Data Relevant to AGI Alignment by Thane Ruthenis Read the full episode description	Dec 15, 2023
AF - AI Control: Improving Safety Despite Intentional Subversion by Buck Shlegeris Read the full episode description	Dec 13, 2023
AF - Some biases and selection effects in AI risk discourse by Tamsin Leake Read the full episode description	Dec 12, 2023
AF - Adversarial Robustness Could Help Prevent Catastrophic Misuse by Aidan O'Gara Read the full episode description	Dec 11, 2023
AF - Empirical work that might shed light on scheming (Section 6 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 11, 2023
AF - Quick thoughts on the implications of multi-agent views of mind on AI takeover by Kaj Sotala Read the full episode description	Dec 11, 2023
AF - Auditing failures vs concentrated failures by Ryan Greenblatt Read the full episode description	Dec 11, 2023
AF - How LDT helps reduce the AI arms race by Tamsin Leake Read the full episode description	Dec 10, 2023
AF - Send us example gnarly bugs by Beth Barnes Read the full episode description	Dec 10, 2023
AF - Summing up "Scheming AIs" (Section 5) by Joe Carlsmith Read the full episode description	Dec 09, 2023
AF - Finding Sparse Linear Connections between Features in LLMs by Logan Riggs Smith Read the full episode description	Dec 09, 2023
AF - Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 08, 2023
AF - Simplicity arguments for scheming (Section 4.3 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 07, 2023
AF - The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 06, 2023
AF - Google Gemini Announced by g-w1 Read the full episode description	Dec 06, 2023
AF - Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 05, 2023
AF - Studying The Alien Mind by Quentin Feuillade--Montixi Read the full episode description	Dec 05, 2023
AF - Deep Forgetting and Unlearning for Safely-Scoped LLMs by Stephen Casper Read the full episode description	Dec 05, 2023
AF - Neural uncertainty estimation for alignment by Charlie Steiner Read the full episode description	Dec 05, 2023
AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks Read the full episode description	Dec 05, 2023
AF - 2023 Alignment Research Updates from FAR AI by AdamGleave Read the full episode description	Dec 04, 2023
AF - What's new at FAR AI by AdamGleave Read the full episode description	Dec 04, 2023
AF - Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 04, 2023
AF - Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 03, 2023
AF - The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 02, 2023
AF - Thoughts on "AI is easy to control" by Pope and Belrose by Steve Byrnes Read the full episode description	Dec 01, 2023
AF - How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Dec 01, 2023
AF - FixDT by Abram Demski Read the full episode description	Nov 30, 2023
AF - Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 30, 2023
AF - [Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit by Spencer Becker-Kahn Read the full episode description	Nov 30, 2023
AF - "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 29, 2023
AF - Intro to Superposition and Sparse Autoencoders (Colab exercises) by CallumMcDougall Read the full episode description	Nov 29, 2023
AF - How to Control an LLM's Behavior (why my P(DOOM) went down) by Roger Dearnaley Read the full episode description	Nov 28, 2023
AF - Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 28, 2023
AF - Anthropic Fall 2023 Debate Progress Update by Ansh Radhakrishnan Read the full episode description	Nov 28, 2023
AF - AISC 2024 - Project Summaries by Nicky Pochinkov Read the full episode description	Nov 27, 2023
AF - There is no IQ for AI by Gabriel Alfour Read the full episode description	Nov 27, 2023
AF - Two concepts of an "episode" (Section 2.2.1 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 27, 2023
AF - Situational awareness (Section 2.1 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 26, 2023
AF - On "slack" in training (Section 1.5 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 25, 2023
AF - Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 24, 2023
AF - Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense by Nate Soares Read the full episode description	Nov 24, 2023
AF - 4. A Moral Case for Evolved-Sapience-Chauvinism by Roger Dearnaley Read the full episode description	Nov 24, 2023
AF - 3. Uploading by Roger Dearnaley Read the full episode description	Nov 23, 2023
AF - Thomas Kwa's research journal by Thomas Kwa Read the full episode description	Nov 23, 2023
AF - A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 22, 2023
AF - Public Call for Interest in Mathematical Alignment by David Manheim Read the full episode description	Nov 22, 2023
AF - Varieties of fake alignment (Section 1.1 of "Scheming AIs") by Joe Carlsmith Read the full episode description	Nov 21, 2023
AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong Read the full episode description	Nov 21, 2023
AF - Agent Boundaries Aren't Markov Blankets. by Abram Demski Read the full episode description	Nov 20, 2023
AF - New paper shows truthfulness and instruction-following don't generalize by default by Josh Clymer Read the full episode description	Nov 19, 2023
AF - My Criticism of Singular Learning Theory by Joar Skalse Read the full episode description	Nov 19, 2023
AF - AI Safety Camp 2024 by Linda Linsefors Read the full episode description	Nov 18, 2023
AF - Sam Altman fired from OpenAI by Lawrence Chan Read the full episode description	Nov 17, 2023
AF - Coup probes trained off-policy by Fabien Roger Read the full episode description	Nov 17, 2023
AF - Evaluating AI Systems for Moral Status Using Self-Reports by Ethan Perez Read the full episode description	Nov 16, 2023
AF - Experiences and learnings from both sides of the AI safety job market by Marius Hobbhahn Read the full episode description	Nov 15, 2023
AF - Theories of Change for AI Auditing by Lee Sharkey Read the full episode description	Nov 13, 2023
AF - Open Phil releases RFPs on LLM Benchmarks and Forecasting by Lawrence Chan Read the full episode description	Nov 11, 2023
AF - We have promising alignment plans with low taxes by Seth Herd Read the full episode description	Nov 10, 2023
AF - Learning-theoretic agenda reading list by Vanessa Kosoy Read the full episode description	Nov 09, 2023
AF - Five projects from AI Safety Hub Labs 2023 by Charlie Griffin Read the full episode description	Nov 08, 2023
AF - Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models by Felix Hofstätter Read the full episode description	Nov 08, 2023
AF - Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation by Soroush Pour Read the full episode description	Nov 07, 2023
AF - Box inversion revisited by Jan Kulveit Read the full episode description	Nov 07, 2023
AF - Announcing TAIS 2024 by Blaine William Rogers Read the full episode description	Nov 06, 2023
AF - Genetic fitness is a measure of selection strength, not the selection target by Kaj Sotala Read the full episode description	Nov 04, 2023
AF - Untrusted smart models and trusted dumb models by Buck Shlegeris Read the full episode description	Nov 04, 2023
AF - Thoughts on open source AI by Sam Marks Read the full episode description	Nov 03, 2023
AF - Mech Interp Challenge: November - Deciphering the Cumulative Sum Model by TheMcDouglas Read the full episode description	Nov 02, 2023
AF - My thoughts on the social response to AI risk by Matthew Barnett Read the full episode description	Nov 01, 2023
AF - Dario Amodei's prepared remarks from the UK AI Safety Summit, on Anthropic's Responsible Scaling Policy by Zac Hatfield-Dodds Read the full episode description	Nov 01, 2023
AF - 4. Risks from causing illegitimate value change (performative predictors) by Nora Ammann Read the full episode description	Oct 26, 2023
AF - 3. Premise three and Conclusion: AI systems can affect value change trajectories and the Value Change Problem by Nora Ammann Read the full episode description	Oct 26, 2023
AF - I don't find the lie detection results that surprising (by an author of the paper) by JanBrauner Read the full episode description	Oct 04, 2023
AF - Some Quick Follow-Up Experiments to "Taken out of context: On measuring situational awareness in LLMs" by miles Read the full episode description	Oct 03, 2023
AF - Direction of Fit by Nicholas Kees Dupuis Read the full episode description	Oct 02, 2023
AF - New Tool: the Residual Stream Viewer by Adam Yedidia Read the full episode description	Oct 01, 2023
AF - How model editing could help with the alignment problem by Michael Ripa Read the full episode description	Sep 30, 2023
AF - How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions by JanBrauner Read the full episode description	Sep 28, 2023
AF - Alignment Workshop talks by Richard Ngo Read the full episode description	Sep 28, 2023
AF - Different views of alignment have different consequences for imperfect methods by Stuart Armstrong Read the full episode description	Sep 28, 2023
AF - Projects I would like to see (possibly at AI Safety Camp) by Linda Linsefors Read the full episode description	Sep 27, 2023
AF - Announcing the CNN Interpretability Competition by Stephen Casper Read the full episode description	Sep 26, 2023
AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya Read the full episode description	Sep 25, 2023
AF - What is wrong with this "utility switch button problem" approach? by Donald Hobson Read the full episode description	Sep 25, 2023
AF - Understanding strategic deception and deceptive alignment by Marius Hobbhahn Read the full episode description	Sep 25, 2023
AF - Sparse Autoencoders: Future Work by Logan Riggs Smith Read the full episode description	Sep 21, 2023
AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith Read the full episode description	Sep 21, 2023
AF - There should be more AI safety orgs by Marius Hobbhahn Read the full episode description	Sep 21, 2023
AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons Read the full episode description	Sep 20, 2023
AF - Interpretability Externalities Case Study - Hungry Hungry Hippos by Magdalena Wache Read the full episode description	Sep 20, 2023
AF - Anthropic's Responsible Scaling Policy and Long Term Benefit Trust by Zac Hatfield-Dodds Read the full episode description	Sep 19, 2023
AF - Where might I direct promising-to-me researchers to apply for alignment jobs/grants? by Abram Demski Read the full episode description	Sep 18, 2023
AF - Three ways interpretability could be impactful by Arthur Conmy Read the full episode description	Sep 18, 2023
AF - Telopheme, telophore, and telotect by Tsvi Benson-Tilsen Read the full episode description	Sep 17, 2023
AF - How to talk about reasons why AGI might not be near? by Kaj Sotala Read the full episode description	Sep 17, 2023
AF - Uncovering Latent Human Wellbeing in LLM Embeddings by ChengCheng Read the full episode description	Sep 14, 2023
AF - Mech Interp Challenge: September - Deciphering the Addition Model by TheMcDouglas Read the full episode description	Sep 13, 2023
AF - Apply to lead a project during the next virtual AI Safety Camp by Linda Linsefors Read the full episode description	Sep 13, 2023
AF - UDT shows that decision theory is more puzzling than ever by Wei Dai Read the full episode description	Sep 13, 2023
AF - Focus on the Hardest Part First by Johannes C. Mayer Read the full episode description	Sep 11, 2023
AF - Explaining grokking through circuit efficiency by Vikrant Varma Read the full episode description	Sep 08, 2023
AF - The Löbian Obstacle, And Why You Should Care by marc/er Read the full episode description	Sep 07, 2023
AF - Recreating the caring drive by Catnee Read the full episode description	Sep 07, 2023
AF - ActAdd: Steering Language Models without Optimization by technicalities Read the full episode description	Sep 06, 2023
AF - What I would do if I wasn't at ARC Evals by Lawrence Chan Read the full episode description	Sep 05, 2023
AF - Benchmarks for Detecting Measurement Tampering [Redwood Research] by Ryan Greenblatt Read the full episode description	Sep 05, 2023
AF - Paper: On measuring situational awareness in LLMs by Owain Evans Read the full episode description	Sep 04, 2023
AF - Fundamental question: What determines a mind's effects? by Tsvi Benson-Tilsen Read the full episode description	Sep 03, 2023
AF - Series of absurd upgrades in nature's great search by Luke H Miles Read the full episode description	Sep 03, 2023
AF - PIBBSS Summer Symposium 2023 by Nora Ammann Read the full episode description	Sep 02, 2023
AF - Tensor Trust: An online game to uncover prompt injection vulnerabilities by Luke Bailey Read the full episode description	Sep 01, 2023
AF - Meta Questions about Metaphilosophy by Wei Dai Read the full episode description	Sep 01, 2023
AF - Responses to apparent rationalist confusions about game / decision theory by Anthony DiGiovanni Read the full episode description	Aug 30, 2023
AF - Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy by Neel Nanda Read the full episode description	Aug 29, 2023
AF - An OV-Coherent Toy Model of Attention Head Superposition by LaurenGreenspan Read the full episode description	Aug 29, 2023
AF - Barriers to Mechanistic Interpretability for AGI Safety by Connor Leahy Read the full episode description	Aug 29, 2023
AF - AI Deception: A Survey of Examples, Risks, and Potential Solutions by Simon Goldstein Read the full episode description	Aug 29, 2023
AF - OpenAI base models are not sycophantic, at any size by nostalgebraist Read the full episode description	Aug 29, 2023
AF - Paradigms and Theory Choice in AI: Adaptivity, Economy and Control by particlemania Read the full episode description	Aug 28, 2023
AF - A list of core AI safety problems and how I hope to solve them by davidad (David A. Dalrymple) Read the full episode description	Aug 26, 2023
AF - Red-teaming language models via activation engineering by Nina Rimsky Read the full episode description	Aug 26, 2023
AF - A Model-based Approach to AI Existential Risk by Samuel Dylan Martin Read the full episode description	Aug 25, 2023
AF - Implications of evidential cooperation in large worlds by Lukas Finnveden Read the full episode description	Aug 23, 2023
AF - Causality and a Cost Semantics for Neural Networks by scottviteri Read the full episode description	Aug 21, 2023
AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann Read the full episode description	Aug 20, 2023
AF - We can do better than DoWhatIMean by Luke H Miles Read the full episode description	Aug 19, 2023
AF - An Overview of Catastrophic AI Risks: Summary by Dan H Read the full episode description	Aug 18, 2023
AF - Managing risks of our own work by Beth Barnes Read the full episode description	Aug 18, 2023
AF - Autonomous replication and adaptation: an attempt at a concrete danger threshold by Hjalmar Wijk Read the full episode description	Aug 17, 2023
AF - If we had known the atmosphere would ignite by Jeffs Read the full episode description	Aug 16, 2023
AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor Read the full episode description	Aug 16, 2023
AF - AGI is easier than robotaxis by Daniel Kokotajlo Read the full episode description	Aug 13, 2023
AF - When discussing AI risks, talk about capabilities, not intelligence by Victoria Krakovna Read the full episode description	Aug 11, 2023
AF - Linkpost: We need another Expert Survey on Progress in AI, urgently by David Mears Read the full episode description	Aug 11, 2023
AF - Could We Automate AI Alignment Research? by Stephen McAleese Read the full episode description	Aug 10, 2023
AF - The positional embedding matrix and previous-token heads: how do they actually work? by Adam Yedidia Read the full episode description	Aug 10, 2023
AF - Mech Interp Challenge: August - Deciphering the First Unique Character Model by TheMcDouglas Read the full episode description	Aug 09, 2023
AF - Ground-Truth Label Imbalance Impairs Contrast-Consistent Search Performance by Tom Angsten Read the full episode description	Aug 09, 2023
AF - Modulating sycophancy in an RLHF model via activation steering by NinaR Read the full episode description	Aug 09, 2023
AF - Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research by Evan Hubinger Read the full episode description	Aug 08, 2023
AF - An interactive introduction to grokking and mechanistic interpretability by Adam Pearce Read the full episode description	Aug 07, 2023
AF - Yann LeCun on AGI and AI Safety by Chris Leong Read the full episode description	Aug 06, 2023
AF - Password-locked models: a stress case for capabilities evaluation by Fabien Roger Read the full episode description	Aug 03, 2023
AF - 3 levels of threat obfuscation by HoldenKarnofsky Read the full episode description	Aug 02, 2023
AF - ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by Beth Barnes Read the full episode description	Aug 01, 2023
AF - The "no sandbagging on checkable tasks" hypothesis by Joe Carlsmith Read the full episode description	Jul 31, 2023
AF - Watermarking considered overrated? by DanielFilan Read the full episode description	Jul 31, 2023
AF - Thoughts on sharing information about language model capabilities by Paul Christiano Read the full episode description	Jul 31, 2023
AF - Open Problems and Fundamental Limitations of RLHF by Stephen Casper Read the full episode description	Jul 31, 2023
AF - When can we trust model evaluations? by Evan Hubinger Read the full episode description	Jul 28, 2023
AF - Reducing sycophancy and improving honesty via activation steering by NinaR Read the full episode description	Jul 28, 2023
AF - Mech Interp Puzzle 2: Word2Vec Style Embeddings by Neel Nanda Read the full episode description	Jul 28, 2023
AF - Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy by Buck Shlegeris Read the full episode description	Jul 26, 2023
AF - Frontier Model Security by Matthew "Vaniver" Gray Read the full episode description	Jul 26, 2023
AF - How LLMs are and are not myopic by janus Read the full episode description	Jul 25, 2023
AF - Open problems in activation engineering by Alex Turner Read the full episode description	Jul 24, 2023
AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope Read the full episode description	Jul 23, 2023
AF - Examples of Prompts that Make GPT-4 Output Falsehoods by Stephen Casper Read the full episode description	Jul 22, 2023
AF - Reward Hacking from a Causal Perspective by Tom Everitt Read the full episode description	Jul 21, 2023
AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti Read the full episode description	Jul 21, 2023
AF - Even Superhuman Go AIs Have Surprising Failures Modes by AdamGleave Read the full episode description	Jul 20, 2023
AF - Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla by Neel Nanda Read the full episode description	Jul 20, 2023
AF - Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping by Robert Kirk Read the full episode description	Jul 20, 2023
AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth Read the full episode description	Jul 19, 2023
AF - Tiny Mech Interp Projects: Emergent Positional Embeddings of Words by Neel Nanda Read the full episode description	Jul 18, 2023
AF - Still no Lie Detector for LLMs by Daniel Herrmann Read the full episode description	Jul 18, 2023
AF - Meta announces Llama 2; "open sources" it for commercial use by Lawrence Chan Read the full episode description	Jul 18, 2023
AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan Read the full episode description	Jul 18, 2023
AF - Thoughts on "Process-Based Supervision" by Steve Byrnes Read the full episode description	Jul 17, 2023
AF - AutoInterpretation Finds Sparse Coding Beats Alternatives by Hoagy Read the full episode description	Jul 17, 2023
AF - Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo by Neel Nanda Read the full episode description	Jul 16, 2023
AF - Robustness of Model-Graded Evaluations and Automated Interpretability by Simon Lermen Read the full episode description	Jul 15, 2023
AF - Eric Michaud on the Quantization Model of Neural Scaling, Interpretability and Grokking by Michaël Trazzi Read the full episode description	Jul 12, 2023
AF - What does the launch of x.ai mean for AI Safety? by Chris Leong Read the full episode description	Jul 12, 2023
AF - Towards Developmental Interpretability by Jesse Hoogland Read the full episode description	Jul 12, 2023
AF - Goal-Direction for Simulated Agents by Raymond D Read the full episode description	Jul 12, 2023
AF - Incentives from a causal perspective by Tom Everitt Read the full episode description	Jul 10, 2023
AF - “Reframing Superintelligence” + LLMs + 4 years by Eric Drexler Read the full episode description	Jul 10, 2023
AF - Open-minded updatelessness by Nicolas Macé Read the full episode description	Jul 10, 2023
AF - Consciousness as a conflationary alliance term by Andrew Critch Read the full episode description	Jul 10, 2023
AF - Really Strong Features Found in Residual Stream by Logan Riggs Smith Read the full episode description	Jul 08, 2023
AF - Seven Strategies for Tackling the Hard Part of the Alignment Problem by Stephen Casper Read the full episode description	Jul 08, 2023
AF - Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI by Benaya Koren Read the full episode description	Jul 08, 2023
AF - "Concepts of Agency in Biology" (Okasha, 2023) - Brief Paper Summary by Nora Ammann Read the full episode description	Jul 08, 2023
AF - Views on when AGI comes and on strategy to reduce existential risk by Tsvi Benson-Tilsen Read the full episode description	Jul 08, 2023
AF - Jesse Hoogland on Developmental Interpretability and Singular Learning Theory by Michaël Trazzi Read the full episode description	Jul 06, 2023
AF - [Linkpost] Introducing Superalignment by Beren Millidge Read the full episode description	Jul 05, 2023
AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith Read the full episode description	Jul 05, 2023
AF - Ten Levels of AI Alignment Difficulty by Samuel Dylan Martin Read the full episode description	Jul 03, 2023
AF - VC Theory Overview by Joar Skalse Read the full episode description	Jul 02, 2023
AF - Sources of evidence in Alignment by Martín Soto Read the full episode description	Jul 02, 2023
AF - Quantitative cruxes in Alignment by Martín Soto Read the full episode description	Jul 02, 2023
AF - How Smart Are Humans? by Joar Skalse Read the full episode description	Jul 02, 2023
AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse Read the full episode description	Jul 02, 2023
AF - Agency from a causal perspective by Tom Everitt Read the full episode description	Jun 30, 2023
AF - When do "brains beat brawn" in Chess? An experiment by titotal Read the full episode description	Jun 28, 2023
AF - Catastrophic Risks from AI #6: Discussion and FAQ by Dan H Read the full episode description	Jun 27, 2023
AF - Catastrophic Risks from AI #5: Rogue AIs by Dan H Read the full episode description	Jun 27, 2023
AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H Read the full episode description	Jun 26, 2023
AF - The fraught voyage of aligned novelty by Tsvi Benson-Tilsen Read the full episode description	Jun 26, 2023
AF - Catastrophic Risks from AI #3: AI Race by Dan H Read the full episode description	Jun 23, 2023
AF - Why Not Subagents? by johnswentworth Read the full episode description	Jun 22, 2023
AF - An Overview of Catastrophic AI Risks #2 by Dan H Read the full episode description	Jun 22, 2023
AF - An Overview of Catastrophic AI Risks #1 by Dan H Read the full episode description	Jun 22, 2023
AF - The Hubinger lectures on AGI safety: an introductory lecture series by Evan Hubinger Read the full episode description	Jun 22, 2023
AF - Causality: A Brief Introduction by Tom Everitt Read the full episode description	Jun 20, 2023
AF - Ban development of unpredictable powerful models? by Alex Turner Read the full episode description	Jun 20, 2023
AF - Mode collapse in RL may be fueled by the update equation by Alex Turner Read the full episode description	Jun 19, 2023
AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas Read the full episode description	Jun 19, 2023
AF - Provisionality by Tsvi Benson-Tilsen Read the full episode description	Jun 19, 2023
AF - Revising Drexler's CAIS model by Matthew Barnett Read the full episode description	Jun 16, 2023
AF - [Replication] Conjecture's Sparse Coding in Small Transformers by Hoagy Read the full episode description	Jun 16, 2023
AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger Read the full episode description	Jun 16, 2023
AF - MetaAI: less is less for alignment. by Cleo Nardo Read the full episode description	Jun 13, 2023
AF - Virtual AI Safety Unconference (VAISU) by Linda Linsefors Read the full episode description	Jun 13, 2023
AF - TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI by Andrew Critch Read the full episode description	Jun 13, 2023
AF - Contingency: A Conceptual Tool from Evolutionary Biology for Alignment by clem acs Read the full episode description	Jun 12, 2023
AF - ARC is hiring theoretical researchers by Paul Christiano Read the full episode description	Jun 12, 2023
AF - Introduction to Towards Causal Foundations of Safe AGI by Tom Everitt Read the full episode description	Jun 12, 2023
AF - Explicitness by Tsvi Benson-Tilsen Read the full episode description	Jun 12, 2023
AF - Inference-Time Intervention: Eliciting Truthful Answers from a Language Model by likenneth Read the full episode description	Jun 11, 2023
AF - How biosafety could inform AI standards by Olivia Jimenez Read the full episode description	Jun 09, 2023
AF - Takeaways from the Mechanistic Interpretability Challenges by Stephen Casper Read the full episode description	Jun 08, 2023
AF - What will GPT-2030 look like? by Jacob Steinhardt Read the full episode description	Jun 07, 2023
AF - An Exercise to Build Intuitions on AGI Risk by Lauro Langosco Read the full episode description	Jun 07, 2023
AF - A Playbook for AI Risk Reduction (focused on misaligned AI) by HoldenKarnofsky Read the full episode description	Jun 06, 2023
AF - AISC end of program presentations by Linda Linsefors Read the full episode description	Jun 06, 2023
AF - Algorithmic Improvement Is Probably Faster Than Scaling Now by johnswentworth Read the full episode description	Jun 06, 2023
AF - Wildfire of strategicness by Tsvi Benson-Tilsen Read the full episode description	Jun 05, 2023
AF - How to Think About Activation Patching by Neel Nanda Read the full episode description	Jun 04, 2023

The Nonlinear Library: Alignment Forum

By The Nonlinear Fund

Category: Education

Open in Apple Podcasts

Open RSS feed

Open Website

Description