Listen to a podcast, please open Podcast Republic app. Available on Google Play Store and Apple App Store.
Episode | Date |
---|---|
AF - An Introduction to AI Sandbagging by Teun van der Weij
|
Apr 26, 2024 |
AF - AXRP Episode 29 - Science of Deep Learning with Vikrant Varma by DanielFilan
|
Apr 25, 2024 |
AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda
|
Apr 25, 2024 |
AF - Simple probes can catch sleeper agents by Monte MacDiarmid
|
Apr 23, 2024 |
AF - Dequantifying first-order theories by Jessica Taylor
|
Apr 23, 2024 |
AF - ProLU: A Pareto Improvement for Sparse Autoencoders by Glen M. Taggart
|
Apr 23, 2024 |
AF - Time complexity for deterministic string machines by alcatal
|
Apr 22, 2024 |
AF - Inducing Unprompted Misalignment in LLMs by Sam Svenningsen
|
Apr 19, 2024 |
AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda
|
Apr 19, 2024 |
AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda
|
Apr 19, 2024 |
AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks
|
Apr 18, 2024 |
AF - LLM Evaluators Recognize and Favor Their Own Generations by Arjun Panickssery
|
Apr 17, 2024 |
AF - Transformers Represent Belief State Geometry in their Residual Stream by Adam Shai
|
Apr 16, 2024 |
AF - Speedrun ruiner research idea by Luke H Miles
|
Apr 13, 2024 |
AF - The theory of Proximal Policy Optimisation implementations by salman.mohammadi
|
Apr 12, 2024 |
AF - How I select alignment research projects by Ethan Perez
|
Apr 10, 2024 |
AF - PIBBSS is hiring in a variety of roles (alignment research and incubation program) by Nora Ammann
|
Apr 09, 2024 |
AF - How We Picture Bayesian Agents by johnswentworth
|
Apr 08, 2024 |
AF - Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock
|
Apr 08, 2024 |
AF - Measuring Predictability of Persona Evaluations by Thee Ho
|
Apr 06, 2024 |
AF - Koan: divining alien datastructures from RAM activations by Tsvi Benson-Tilsen
|
Apr 05, 2024 |
AF - LLMs for Alignment Research: a safety priority? by Abram Demski
|
Apr 04, 2024 |
AF - Run evals on base models too! by orthonormal
|
Apr 04, 2024 |
AF - The Case for Predictive Models by Rubi Hudson
|
Apr 03, 2024 |
AF - Sparsify: A mechanistic interpretability research agenda by Lee Sharkey
|
Apr 03, 2024 |
AF - A Selection of Randomly Selected SAE Features by CallumMcDougall
|
Apr 01, 2024 |
AF - SAE-VIS: Announcement Post by CallumMcDougall
|
Mar 31, 2024 |
AF - Your LLM Judge may be biased by Rachel Freedman
|
Mar 29, 2024 |
AF - SAE reconstruction errors are (empirically) pathological by Wes Gurnee
|
Mar 29, 2024 |
AF - How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles and parrots by Owain Evans
|
Mar 28, 2024 |
AF - UDT1.01: The Story So Far (1/10) by Diffractor
|
Mar 27, 2024 |
AF - Modern Transformers are AGI, and Human-Level by Abram Demski
|
Mar 26, 2024 |
AF - Third-party testing as a key ingredient of AI policy by Zac Hatfield-Dodds
|
Mar 25, 2024 |
AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin
|
Mar 25, 2024 |
AF - On the Confusion between Inner and Outer Misalignment by Chris Leong
|
Mar 25, 2024 |
AF - Dangers of Closed-Loop AI by Gordon Seidoh Worley
|
Mar 22, 2024 |
AF - Video and transcript of presentation on Scheming AIs by Joe Carlsmith
|
Mar 22, 2024 |
AF - Comparing Alignment to other AGI interventions: Extensions and analysis by Martín Soto
|
Mar 21, 2024 |
AF - Stagewise Development in Neural Networks by Jesse Hoogland
|
Mar 20, 2024 |
AF - Natural Latents: The Concepts by johnswentworth
|
Mar 20, 2024 |
AF - Comparing Alignment to other AGI interventions: Basic model by Martín Soto
|
Mar 20, 2024 |
AF - New report: Safety Cases for AI by Josh Clymer
|
Mar 20, 2024 |
AF - AtP*: An efficient and scalable method for localizing LLM behaviour to components by Neel Nanda
|
Mar 18, 2024 |
AF - Improving SAE's by Sqrt()-ing L1 and Removing Lowest Activating Features by Logan Riggs Smith
|
Mar 15, 2024 |
AF - More people getting into AI safety should do a PhD by AdamGleave
|
Mar 14, 2024 |
AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability and Open Problems by Sonia Joseph
|
Mar 13, 2024 |
AF - Virtual AI Safety Unconference 2024 by Orpheus Lummis
|
Mar 13, 2024 |
AF - Transformer Debugger by Henk Tillman
|
Mar 12, 2024 |
AF - Open consultancy: Letting untrusted AIs choose what answer to argue for by Fabien Roger
|
Mar 12, 2024 |
AF - Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought by miles
|
Mar 11, 2024 |
AF - How disagreements about Evidential Correlations could be settled by Martín Soto
|
Mar 11, 2024 |
AF - Understanding SAE Features with the Logit Lens by Joseph Isaac Bloom
|
Mar 11, 2024 |
AF - 0th Person and 1st Person Logic by Adele Lopez
|
Mar 10, 2024 |
AF - Scenario Forecasting Workshop: Materials and Learnings by elifland
|
Mar 08, 2024 |
AF - Forecasting future gains due to post-training enhancements by elifland
|
Mar 08, 2024 |
AF - Evidential Correlations are Subjective, and it might be a problem by Martín Soto
|
Mar 07, 2024 |
AF - We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To by robertzk
|
Mar 06, 2024 |
AF - Many arguments for AI x-risk are wrong by Alex Turner
|
Mar 05, 2024 |
AF - Anthropic release Claude 3, claims >GPT-4 Performance by Lawrence Chan
|
Mar 04, 2024 |
AF - Some costs of superposition by Linda Linsefors
|
Mar 03, 2024 |
AF - Approaching Human-Level Forecasting with Language Models by Fred Zhang
|
Feb 29, 2024 |
AF - Tips for Empirical Alignment Research by Ethan Perez
|
Feb 29, 2024 |
AF - Post series on "Liability Law for reducing Existential Risk from AI" by Nora Ammann
|
Feb 29, 2024 |
AF - Timaeus's First Four Months by Jesse Hoogland
|
Feb 28, 2024 |
AF - Notes on control evaluations for safety cases by Ryan Greenblatt
|
Feb 28, 2024 |
AF - Counting arguments provide no evidence for AI doom by Nora Belrose
|
Feb 27, 2024 |
AF - Deconfusing In-Context Learning by Arjun Panickssery
|
Feb 25, 2024 |
AF - Instrumental deception and manipulation in LLMs - a case study by Olli Järviniemi
|
Feb 24, 2024 |
AF - The Shutdown Problem: Incomplete Preferences as a Solution by Elliott Thornley
|
Feb 23, 2024 |
AF - Analogies between scaling labs and misaligned superintelligent AI by Stephen Casper
|
Feb 21, 2024 |
AF - Extinction Risks from AI: Invisible to Science? by Vojtech Kovarik
|
Feb 21, 2024 |
AF - Extinction-level Goodhart's Law as a Property of the Environment by Vojtech Kovarik
|
Feb 21, 2024 |
AF - Dynamics Crucial to AI Risk Seem to Make for Complicated Models by Vojtech Kovarik
|
Feb 21, 2024 |
AF - Which Model Properties are Necessary for Evaluating an Argument? by Vojtech Kovarik
|
Feb 21, 2024 |
AF - Weak vs Quantitative Extinction-level Goodhart's Law by Vojtech Kovarik
|
Feb 21, 2024 |
AF - Why does generalization work? by Martín Soto
|
Feb 20, 2024 |
AF - Complexity classes for alignment properties by Arun Jose
|
Feb 20, 2024 |
AF - Protocol evaluations: good analogies vs control by Fabien Roger
|
Feb 19, 2024 |
AF - Self-Awareness: Taxonomy and eval suite proposal by Daniel Kokotajlo
|
Feb 17, 2024 |
AF - The Pointer Resolution Problem by Arun Jose
|
Feb 16, 2024 |
AF - Retrospective: PIBBSS Fellowship 2023 by DusanDNesic
|
Feb 16, 2024 |
AF - Searching for Searching for Search by Rubi Hudson
|
Feb 14, 2024 |
AF - Critiques of the AI control agenda by Arun Jose
|
Feb 14, 2024 |
AF - Requirements for a Basin of Attraction to Alignment by Roger Dearnaley
|
Feb 14, 2024 |
AF - Interpreting Quantum Mechanics in Infra-Bayesian Physicalism by Yegreg
|
Feb 12, 2024 |
AF - Natural abstractions are observer-dependent: a conversation with John Wentworth by Martín Soto
|
Feb 12, 2024 |
AF - Updatelessness doesn't solve most problems by Martín Soto
|
Feb 08, 2024 |
AF - Debating with More Persuasive LLMs Leads to More Truthful Answers by Akbir Khan
|
Feb 07, 2024 |
AF - How to train your own "Sleeper Agents" by Evan Hubinger
|
Feb 07, 2024 |
AF - what does davidad want from "boundaries"? by Chipmonk
|
Feb 06, 2024 |
AF - Preventing exfiltration via upload limits seems promising by Ryan Greenblatt
|
Feb 06, 2024 |
AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane
|
Feb 03, 2024 |
AF - Survey for alignment researchers: help us build better field-level models by Cameron Berg
|
Feb 02, 2024 |
AF - Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by Joseph Isaac Bloom
|
Feb 02, 2024 |
AF - Evaluating Stability of Unreflective Alignment by james.lucassen
|
Feb 01, 2024 |
AF - PIBBSS Speaker events comings up in February by DusanDNesic
|
Feb 01, 2024 |
AF - Last call for submissions for TAIS 2024! by Blaine William Rogers
|
Jan 30, 2024 |
AF - The case for more ambitious language model evals by Arun Jose
|
Jan 30, 2024 |
AF - Agents that act for reasons: a thought experiment by Michele Campolo
|
Jan 24, 2024 |
AF - We need a science of evals by Marius Hobbhahn
|
Jan 22, 2024 |
AF - InterLab - a toolkit for experiments with multi-agent interactions by Tomáš Gavenčiak
|
Jan 22, 2024 |
AF - A Shutdown Problem Proposal by johnswentworth
|
Jan 21, 2024 |
AF - Four visions of Transformative AI success by Steve Byrnes
|
Jan 17, 2024 |
AF - Managing catastrophic misuse without robust AIs by Ryan Greenblatt
|
Jan 16, 2024 |
AF - Sparse Autoencoders Work on Attention Layer Outputs by Connor Kissane
|
Jan 16, 2024 |
AF - Investigating Bias Representations in LLMs via Activation Steering by DawnLu
|
Jan 15, 2024 |
AF - Goals selected from learned knowledge: an alternative to RL alignment by Seth Herd
|
Jan 15, 2024 |
AF - Three Types of Constraints in the Space of Agents by Nora Ammann
|
Jan 15, 2024 |
AF - Introducing Alignment Stress-Testing at Anthropic by Evan Hubinger
|
Jan 12, 2024 |
AF - Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Evan Hubinger
|
Jan 12, 2024 |
AF - Apply to the PIBBSS Summer Research Fellowship by Nora Ammann
|
Jan 12, 2024 |
AF - Goodbye, Shoggoth: The Stage, its Animatronics, and the Puppeteer - a New Metaphor by Roger Dearnaley
|
Jan 09, 2024 |
AF - A starter guide for evals by Marius Hobbhahn
|
Jan 08, 2024 |
AF - Deceptive AI Deceptively-aligned AI by Steve Byrnes
|
Jan 07, 2024 |
AF - Catching AIs red-handed by Ryan Greenblatt
|
Jan 05, 2024 |
AF - Predictive model agents are sort of corrigible by Raymond D
|
Jan 05, 2024 |
AF - What's up with LLMs representing XORs of arbitrary features? by Sam Marks
|
Jan 03, 2024 |
AF - Safety First: safety before full alignment. The deontic sufficiency hypothesis. by Chipmonk
|
Jan 03, 2024 |
AF - Steering Llama-2 with contrastive activation additions by Nina Rimsky
|
Jan 02, 2024 |
AF - Mech Interp Challenge: January - Deciphering the Caesar Cipher Model by CallumMcDougall
|
Jan 01, 2024 |
AF - A hermeneutic net for agency by Tsvi Benson-Tilsen
|
Jan 01, 2024 |
AF - A case for AI alignment being difficult by Jessica Taylor
|
Dec 31, 2023 |
AF - AI Alignment Metastrategy by Vanessa Kosoy
|
Dec 31, 2023 |
AF - Free agents by Michele Campolo
|
Dec 27, 2023 |
AF - Critical review of Christiano's disagreements with Yudkowsky by Vanessa Kosoy
|
Dec 27, 2023 |
AF - AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them by Roman Leventov
|
Dec 27, 2023 |
AF - 5. Moral Value for Sentient Animals? Alas, Not Yet by Roger Dearnaley
|
Dec 27, 2023 |
AF - Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) by Neel Nanda
|
Dec 23, 2023 |
AF - Measurement tampering detection as a special case of weak-to-strong generalization by Ryan Greenblatt
|
Dec 23, 2023 |
AF - Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) by Thane Ruthenis
|
Dec 22, 2023 |
AF - Open positions: Research Analyst at the AI Standards Lab by Koen Holtman
|
Dec 22, 2023 |
AF - How Would an Utopia-Maximizer Look Like? by Thane Ruthenis
|
Dec 20, 2023 |
AF - Meaning and Agency by Abram Demski
|
Dec 19, 2023 |
AF - Don't Share Information Exfohazardous on Others' AI-Risk Models by Thane Ruthenis
|
Dec 19, 2023 |
AF - Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize by Owain Evans
|
Dec 19, 2023 |
AF - Assessment of AI safety agendas: think about the downside risk by Roman Leventov
|
Dec 19, 2023 |
AF - The Shortest Path Between Scylla and Charybdis by Thane Ruthenis
|
Dec 18, 2023 |
AF - Discussion: Challenges with Unsupervised LLM Knowledge Discovery by Seb Farquhar
|
Dec 18, 2023 |
AF - Interpreting the Learning of Deceit by Roger Dearnaley
|
Dec 18, 2023 |
AF - A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans by Thane Ruthenis
|
Dec 17, 2023 |
AF - OpenAI, DeepMind, Anthropic, etc. should shut down. by Tamsin Leake
|
Dec 17, 2023 |
AF - Bounty: Diverse hard tasks for LLM agents by Beth Barnes
|
Dec 17, 2023 |
AF - Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem by Ansh Radhakrishnan
|
Dec 16, 2023 |
AF - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision by leogao
|
Dec 16, 2023 |
AF - Current AIs Provide Nearly No Data Relevant to AGI Alignment by Thane Ruthenis
|
Dec 15, 2023 |
AF - AI Control: Improving Safety Despite Intentional Subversion by Buck Shlegeris
|
Dec 13, 2023 |
AF - Some biases and selection effects in AI risk discourse by Tamsin Leake
|
Dec 12, 2023 |
AF - Adversarial Robustness Could Help Prevent Catastrophic Misuse by Aidan O'Gara
|
Dec 11, 2023 |
AF - Empirical work that might shed light on scheming (Section 6 of "Scheming AIs") by Joe Carlsmith
|
Dec 11, 2023 |
AF - Quick thoughts on the implications of multi-agent views of mind on AI takeover by Kaj Sotala
|
Dec 11, 2023 |
AF - Auditing failures vs concentrated failures by Ryan Greenblatt
|
Dec 11, 2023 |
AF - How LDT helps reduce the AI arms race by Tamsin Leake
|
Dec 10, 2023 |
AF - Send us example gnarly bugs by Beth Barnes
|
Dec 10, 2023 |
AF - Summing up "Scheming AIs" (Section 5) by Joe Carlsmith
|
Dec 09, 2023 |
AF - Finding Sparse Linear Connections between Features in LLMs by Logan Riggs Smith
|
Dec 09, 2023 |
AF - Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs") by Joe Carlsmith
|
Dec 08, 2023 |
AF - Simplicity arguments for scheming (Section 4.3 of "Scheming AIs") by Joe Carlsmith
|
Dec 07, 2023 |
AF - The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs") by Joe Carlsmith
|
Dec 06, 2023 |
AF - Google Gemini Announced by g-w1
|
Dec 06, 2023 |
AF - Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs") by Joe Carlsmith
|
Dec 05, 2023 |
AF - Studying The Alien Mind by Quentin Feuillade--Montixi
|
Dec 05, 2023 |
AF - Deep Forgetting and Unlearning for Safely-Scoped LLMs by Stephen Casper
|
Dec 05, 2023 |
AF - Neural uncertainty estimation for alignment by Charlie Steiner
|
Dec 05, 2023 |
AF - Some open-source dictionaries and dictionary learning infrastructure by Sam Marks
|
Dec 05, 2023 |
AF - 2023 Alignment Research Updates from FAR AI by AdamGleave
|
Dec 04, 2023 |
AF - What's new at FAR AI by AdamGleave
|
Dec 04, 2023 |
AF - Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs") by Joe Carlsmith
|
Dec 04, 2023 |
AF - Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs") by Joe Carlsmith
|
Dec 03, 2023 |
AF - The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs") by Joe Carlsmith
|
Dec 02, 2023 |
AF - Thoughts on "AI is easy to control" by Pope and Belrose by Steve Byrnes
|
Dec 01, 2023 |
AF - How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs") by Joe Carlsmith
|
Dec 01, 2023 |
AF - FixDT by Abram Demski
|
Nov 30, 2023 |
AF - Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs") by Joe Carlsmith
|
Nov 30, 2023 |
AF - [Linkpost] Remarks on the Convergence in Distribution of Random Neural Networks to Gaussian Processes in the Infinite Width Limit by Spencer Becker-Kahn
|
Nov 30, 2023 |
AF - "Clean" vs. "messy" goal-directedness (Section 2.2.3 of "Scheming AIs") by Joe Carlsmith
|
Nov 29, 2023 |
AF - Intro to Superposition and Sparse Autoencoders (Colab exercises) by CallumMcDougall
|
Nov 29, 2023 |
AF - How to Control an LLM's Behavior (why my P(DOOM) went down) by Roger Dearnaley
|
Nov 28, 2023 |
AF - Two sources of beyond-episode goals (Section 2.2.2 of "Scheming AIs") by Joe Carlsmith
|
Nov 28, 2023 |
AF - Anthropic Fall 2023 Debate Progress Update by Ansh Radhakrishnan
|
Nov 28, 2023 |
AF - AISC 2024 - Project Summaries by Nicky Pochinkov
|
Nov 27, 2023 |
AF - There is no IQ for AI by Gabriel Alfour
|
Nov 27, 2023 |
AF - Two concepts of an "episode" (Section 2.2.1 of "Scheming AIs") by Joe Carlsmith
|
Nov 27, 2023 |
AF - Situational awareness (Section 2.1 of "Scheming AIs") by Joe Carlsmith
|
Nov 26, 2023 |
AF - On "slack" in training (Section 1.5 of "Scheming AIs") by Joe Carlsmith
|
Nov 25, 2023 |
AF - Why focus on schemers in particular (Sections 1.3 and 1.4 of "Scheming AIs") by Joe Carlsmith
|
Nov 24, 2023 |
AF - Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense by Nate Soares
|
Nov 24, 2023 |
AF - 4. A Moral Case for Evolved-Sapience-Chauvinism by Roger Dearnaley
|
Nov 24, 2023 |
AF - 3. Uploading by Roger Dearnaley
|
Nov 23, 2023 |
AF - Thomas Kwa's research journal by Thomas Kwa
|
Nov 23, 2023 |
AF - A taxonomy of non-schemer models (Section 1.2 of "Scheming AIs") by Joe Carlsmith
|
Nov 22, 2023 |
AF - Public Call for Interest in Mathematical Alignment by David Manheim
|
Nov 22, 2023 |
AF - Varieties of fake alignment (Section 1.1 of "Scheming AIs") by Joe Carlsmith
|
Nov 21, 2023 |
AF - Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example by Stuart Armstrong
|
Nov 21, 2023 |
AF - Agent Boundaries Aren't Markov Blankets. by Abram Demski
|
Nov 20, 2023 |
AF - New paper shows truthfulness and instruction-following don't generalize by default by Josh Clymer
|
Nov 19, 2023 |
AF - My Criticism of Singular Learning Theory by Joar Skalse
|
Nov 19, 2023 |
AF - AI Safety Camp 2024 by Linda Linsefors
|
Nov 18, 2023 |
AF - Sam Altman fired from OpenAI by Lawrence Chan
|
Nov 17, 2023 |
AF - Coup probes trained off-policy by Fabien Roger
|
Nov 17, 2023 |
AF - Evaluating AI Systems for Moral Status Using Self-Reports by Ethan Perez
|
Nov 16, 2023 |
AF - Experiences and learnings from both sides of the AI safety job market by Marius Hobbhahn
|
Nov 15, 2023 |
AF - Theories of Change for AI Auditing by Lee Sharkey
|
Nov 13, 2023 |
AF - Open Phil releases RFPs on LLM Benchmarks and Forecasting by Lawrence Chan
|
Nov 11, 2023 |
AF - We have promising alignment plans with low taxes by Seth Herd
|
Nov 10, 2023 |
AF - Learning-theoretic agenda reading list by Vanessa Kosoy
|
Nov 09, 2023 |
AF - Five projects from AI Safety Hub Labs 2023 by Charlie Griffin
|
Nov 08, 2023 |
AF - Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models by Felix Hofstätter
|
Nov 08, 2023 |
AF - Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation by Soroush Pour
|
Nov 07, 2023 |
AF - Box inversion revisited by Jan Kulveit
|
Nov 07, 2023 |
AF - Announcing TAIS 2024 by Blaine William Rogers
|
Nov 06, 2023 |
AF - Genetic fitness is a measure of selection strength, not the selection target by Kaj Sotala
|
Nov 04, 2023 |
AF - Untrusted smart models and trusted dumb models by Buck Shlegeris
|
Nov 04, 2023 |
AF - Thoughts on open source AI by Sam Marks
|
Nov 03, 2023 |
AF - Mech Interp Challenge: November - Deciphering the Cumulative Sum Model by TheMcDouglas
|
Nov 02, 2023 |
AF - My thoughts on the social response to AI risk by Matthew Barnett
|
Nov 01, 2023 |
AF - Dario Amodei's prepared remarks from the UK AI Safety Summit, on Anthropic's Responsible Scaling Policy by Zac Hatfield-Dodds
|
Nov 01, 2023 |
AF - 4. Risks from causing illegitimate value change (performative predictors) by Nora Ammann
|
Oct 26, 2023 |
AF - 3. Premise three and Conclusion: AI systems can affect value change trajectories and the Value Change Problem by Nora Ammann
|
Oct 26, 2023 |
AF - I don't find the lie detection results that surprising (by an author of the paper) by JanBrauner
|
Oct 04, 2023 |
AF - Some Quick Follow-Up Experiments to "Taken out of context: On measuring situational awareness in LLMs" by miles
|
Oct 03, 2023 |
AF - Direction of Fit by Nicholas Kees Dupuis
|
Oct 02, 2023 |
AF - New Tool: the Residual Stream Viewer by Adam Yedidia
|
Oct 01, 2023 |
AF - How model editing could help with the alignment problem by Michael Ripa
|
Sep 30, 2023 |
AF - How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions by JanBrauner
|
Sep 28, 2023 |
AF - Alignment Workshop talks by Richard Ngo
|
Sep 28, 2023 |
AF - Different views of alignment have different consequences for imperfect methods by Stuart Armstrong
|
Sep 28, 2023 |
AF - Projects I would like to see (possibly at AI Safety Camp) by Linda Linsefors
|
Sep 27, 2023 |
AF - Announcing the CNN Interpretability Competition by Stephen Casper
|
Sep 26, 2023 |
AF - Impact stories for model internals: an exercise for interpretability researchers by Jenny Nitishinskaya
|
Sep 25, 2023 |
AF - What is wrong with this "utility switch button problem" approach? by Donald Hobson
|
Sep 25, 2023 |
AF - Understanding strategic deception and deceptive alignment by Marius Hobbhahn
|
Sep 25, 2023 |
AF - Sparse Autoencoders: Future Work by Logan Riggs Smith
|
Sep 21, 2023 |
AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith
|
Sep 21, 2023 |
AF - There should be more AI safety orgs by Marius Hobbhahn
|
Sep 21, 2023 |
AF - Image Hijacks: Adversarial Images can Control Generative Models at Runtime by Scott Emmons
|
Sep 20, 2023 |
AF - Interpretability Externalities Case Study - Hungry Hungry Hippos by Magdalena Wache
|
Sep 20, 2023 |
AF - Anthropic's Responsible Scaling Policy and Long Term Benefit Trust by Zac Hatfield-Dodds
|
Sep 19, 2023 |
AF - Where might I direct promising-to-me researchers to apply for alignment jobs/grants? by Abram Demski
|
Sep 18, 2023 |
AF - Three ways interpretability could be impactful by Arthur Conmy
|
Sep 18, 2023 |
AF - Telopheme, telophore, and telotect by Tsvi Benson-Tilsen
|
Sep 17, 2023 |
AF - How to talk about reasons why AGI might not be near? by Kaj Sotala
|
Sep 17, 2023 |
AF - Uncovering Latent Human Wellbeing in LLM Embeddings by ChengCheng
|
Sep 14, 2023 |
AF - Mech Interp Challenge: September - Deciphering the Addition Model by TheMcDouglas
|
Sep 13, 2023 |
AF - Apply to lead a project during the next virtual AI Safety Camp by Linda Linsefors
|
Sep 13, 2023 |
AF - UDT shows that decision theory is more puzzling than ever by Wei Dai
|
Sep 13, 2023 |
AF - Focus on the Hardest Part First by Johannes C. Mayer
|
Sep 11, 2023 |
AF - Explaining grokking through circuit efficiency by Vikrant Varma
|
Sep 08, 2023 |
AF - The Löbian Obstacle, And Why You Should Care by marc/er
|
Sep 07, 2023 |
AF - Recreating the caring drive by Catnee
|
Sep 07, 2023 |
AF - ActAdd: Steering Language Models without Optimization by technicalities
|
Sep 06, 2023 |
AF - What I would do if I wasn't at ARC Evals by Lawrence Chan
|
Sep 05, 2023 |
AF - Benchmarks for Detecting Measurement Tampering [Redwood Research] by Ryan Greenblatt
|
Sep 05, 2023 |
AF - Paper: On measuring situational awareness in LLMs by Owain Evans
|
Sep 04, 2023 |
AF - Fundamental question: What determines a mind's effects? by Tsvi Benson-Tilsen
|
Sep 03, 2023 |
AF - Series of absurd upgrades in nature's great search by Luke H Miles
|
Sep 03, 2023 |
AF - PIBBSS Summer Symposium 2023 by Nora Ammann
|
Sep 02, 2023 |
AF - Tensor Trust: An online game to uncover prompt injection vulnerabilities by Luke Bailey
|
Sep 01, 2023 |
AF - Meta Questions about Metaphilosophy by Wei Dai
|
Sep 01, 2023 |
AF - Responses to apparent rationalist confusions about game / decision theory by Anthony DiGiovanni
|
Aug 30, 2023 |
AF - Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy by Neel Nanda
|
Aug 29, 2023 |
AF - An OV-Coherent Toy Model of Attention Head Superposition by LaurenGreenspan
|
Aug 29, 2023 |
AF - Barriers to Mechanistic Interpretability for AGI Safety by Connor Leahy
|
Aug 29, 2023 |
AF - AI Deception: A Survey of Examples, Risks, and Potential Solutions by Simon Goldstein
|
Aug 29, 2023 |
AF - OpenAI base models are not sycophantic, at any size by nostalgebraist
|
Aug 29, 2023 |
AF - Paradigms and Theory Choice in AI: Adaptivity, Economy and Control by particlemania
|
Aug 28, 2023 |
AF - A list of core AI safety problems and how I hope to solve them by davidad (David A. Dalrymple)
|
Aug 26, 2023 |
AF - Red-teaming language models via activation engineering by Nina Rimsky
|
Aug 26, 2023 |
AF - A Model-based Approach to AI Existential Risk by Samuel Dylan Martin
|
Aug 25, 2023 |
AF - Implications of evidential cooperation in large worlds by Lukas Finnveden
|
Aug 23, 2023 |
AF - Causality and a Cost Semantics for Neural Networks by scottviteri
|
Aug 21, 2023 |
AF - "Dirty concepts" in AI alignment discourses, and some guesses for how to deal with them by Nora Ammann
|
Aug 20, 2023 |
AF - We can do better than DoWhatIMean by Luke H Miles
|
Aug 19, 2023 |
AF - An Overview of Catastrophic AI Risks: Summary by Dan H
|
Aug 18, 2023 |
AF - Managing risks of our own work by Beth Barnes
|
Aug 18, 2023 |
AF - Autonomous replication and adaptation: an attempt at a concrete danger threshold by Hjalmar Wijk
|
Aug 17, 2023 |
AF - If we had known the atmosphere would ignite by Jeffs
|
Aug 16, 2023 |
AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor
|
Aug 16, 2023 |
AF - AGI is easier than robotaxis by Daniel Kokotajlo
|
Aug 13, 2023 |
AF - When discussing AI risks, talk about capabilities, not intelligence by Victoria Krakovna
|
Aug 11, 2023 |
AF - Linkpost: We need another Expert Survey on Progress in AI, urgently by David Mears
|
Aug 11, 2023 |
AF - Could We Automate AI Alignment Research? by Stephen McAleese
|
Aug 10, 2023 |
AF - The positional embedding matrix and previous-token heads: how do they actually work? by Adam Yedidia
|
Aug 10, 2023 |
AF - Mech Interp Challenge: August - Deciphering the First Unique Character Model by TheMcDouglas
|
Aug 09, 2023 |
AF - Ground-Truth Label Imbalance Impairs Contrast-Consistent Search Performance by Tom Angsten
|
Aug 09, 2023 |
AF - Modulating sycophancy in an RLHF model via activation steering by NinaR
|
Aug 09, 2023 |
AF - Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research by Evan Hubinger
|
Aug 08, 2023 |
AF - An interactive introduction to grokking and mechanistic interpretability by Adam Pearce
|
Aug 07, 2023 |
AF - Yann LeCun on AGI and AI Safety by Chris Leong
|
Aug 06, 2023 |
AF - Password-locked models: a stress case for capabilities evaluation by Fabien Roger
|
Aug 03, 2023 |
AF - 3 levels of threat obfuscation by HoldenKarnofsky
|
Aug 02, 2023 |
AF - ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by Beth Barnes
|
Aug 01, 2023 |
AF - The "no sandbagging on checkable tasks" hypothesis by Joe Carlsmith
|
Jul 31, 2023 |
AF - Watermarking considered overrated? by DanielFilan
|
Jul 31, 2023 |
AF - Thoughts on sharing information about language model capabilities by Paul Christiano
|
Jul 31, 2023 |
AF - Open Problems and Fundamental Limitations of RLHF by Stephen Casper
|
Jul 31, 2023 |
AF - When can we trust model evaluations? by Evan Hubinger
|
Jul 28, 2023 |
AF - Reducing sycophancy and improving honesty via activation steering by NinaR
|
Jul 28, 2023 |
AF - Mech Interp Puzzle 2: Word2Vec Style Embeddings by Neel Nanda
|
Jul 28, 2023 |
AF - Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy by Buck Shlegeris
|
Jul 26, 2023 |
AF - Frontier Model Security by Matthew "Vaniver" Gray
|
Jul 26, 2023 |
AF - How LLMs are and are not myopic by janus
|
Jul 25, 2023 |
AF - Open problems in activation engineering by Alex Turner
|
Jul 24, 2023 |
AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope
|
Jul 23, 2023 |
AF - Examples of Prompts that Make GPT-4 Output Falsehoods by Stephen Casper
|
Jul 22, 2023 |
AF - Reward Hacking from a Causal Perspective by Tom Everitt
|
Jul 21, 2023 |
AF - Priorities for the UK Foundation Models Taskforce by Andrea Miotti
|
Jul 21, 2023 |
AF - Even Superhuman Go AIs Have Surprising Failures Modes by AdamGleave
|
Jul 20, 2023 |
AF - Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla by Neel Nanda
|
Jul 20, 2023 |
AF - Speculative inferences about path dependence in LLM supervised fine-tuning from results on linear mode connectivity and model souping by Robert Kirk
|
Jul 20, 2023 |
AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth
|
Jul 19, 2023 |
AF - Tiny Mech Interp Projects: Emergent Positional Embeddings of Words by Neel Nanda
|
Jul 18, 2023 |
AF - Still no Lie Detector for LLMs by Daniel Herrmann
|
Jul 18, 2023 |
AF - Meta announces Llama 2; "open sources" it for commercial use by Lawrence Chan
|
Jul 18, 2023 |
AF - Measuring and Improving the Faithfulness of Model-Generated Reasoning by Ansh Radhakrishnan
|
Jul 18, 2023 |
AF - Thoughts on "Process-Based Supervision" by Steve Byrnes
|
Jul 17, 2023 |
AF - AutoInterpretation Finds Sparse Coding Beats Alternatives by Hoagy
|
Jul 17, 2023 |
AF - Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo by Neel Nanda
|
Jul 16, 2023 |
AF - Robustness of Model-Graded Evaluations and Automated Interpretability by Simon Lermen
|
Jul 15, 2023 |
AF - Eric Michaud on the Quantization Model of Neural Scaling, Interpretability and Grokking by Michaël Trazzi
|
Jul 12, 2023 |
AF - What does the launch of x.ai mean for AI Safety? by Chris Leong
|
Jul 12, 2023 |
AF - Towards Developmental Interpretability by Jesse Hoogland
|
Jul 12, 2023 |
AF - Goal-Direction for Simulated Agents by Raymond D
|
Jul 12, 2023 |
AF - Incentives from a causal perspective by Tom Everitt
|
Jul 10, 2023 |
AF - “Reframing Superintelligence” + LLMs + 4 years by Eric Drexler
|
Jul 10, 2023 |
AF - Open-minded updatelessness by Nicolas Macé
|
Jul 10, 2023 |
AF - Consciousness as a conflationary alliance term by Andrew Critch
|
Jul 10, 2023 |
AF - Really Strong Features Found in Residual Stream by Logan Riggs Smith
|
Jul 08, 2023 |
AF - Seven Strategies for Tackling the Hard Part of the Alignment Problem by Stephen Casper
|
Jul 08, 2023 |
AF - Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI by Benaya Koren
|
Jul 08, 2023 |
AF - "Concepts of Agency in Biology" (Okasha, 2023) - Brief Paper Summary by Nora Ammann
|
Jul 08, 2023 |
AF - Views on when AGI comes and on strategy to reduce existential risk by Tsvi Benson-Tilsen
|
Jul 08, 2023 |
AF - Jesse Hoogland on Developmental Interpretability and Singular Learning Theory by Michaël Trazzi
|
Jul 06, 2023 |
AF - [Linkpost] Introducing Superalignment by Beren Millidge
|
Jul 05, 2023 |
AF - (tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders by Logan Riggs Smith
|
Jul 05, 2023 |
AF - Ten Levels of AI Alignment Difficulty by Samuel Dylan Martin
|
Jul 03, 2023 |
AF - VC Theory Overview by Joar Skalse
|
Jul 02, 2023 |
AF - Sources of evidence in Alignment by Martín Soto
|
Jul 02, 2023 |
AF - Quantitative cruxes in Alignment by Martín Soto
|
Jul 02, 2023 |
AF - How Smart Are Humans? by Joar Skalse
|
Jul 02, 2023 |
AF - Using (Uninterpretable) LLMs to Generate Interpretable AI Code by Joar Skalse
|
Jul 02, 2023 |
AF - Agency from a causal perspective by Tom Everitt
|
Jun 30, 2023 |
AF - When do "brains beat brawn" in Chess? An experiment by titotal
|
Jun 28, 2023 |
AF - Catastrophic Risks from AI #6: Discussion and FAQ by Dan H
|
Jun 27, 2023 |
AF - Catastrophic Risks from AI #5: Rogue AIs by Dan H
|
Jun 27, 2023 |
AF - Catastrophic Risks from AI #4: Organizational Risks by Dan H
|
Jun 26, 2023 |
AF - The fraught voyage of aligned novelty by Tsvi Benson-Tilsen
|
Jun 26, 2023 |
AF - Catastrophic Risks from AI #3: AI Race by Dan H
|
Jun 23, 2023 |
AF - Why Not Subagents? by johnswentworth
|
Jun 22, 2023 |
AF - An Overview of Catastrophic AI Risks #2 by Dan H
|
Jun 22, 2023 |
AF - An Overview of Catastrophic AI Risks #1 by Dan H
|
Jun 22, 2023 |
AF - The Hubinger lectures on AGI safety: an introductory lecture series by Evan Hubinger
|
Jun 22, 2023 |
AF - Causality: A Brief Introduction by Tom Everitt
|
Jun 20, 2023 |
AF - Ban development of unpredictable powerful models? by Alex Turner
|
Jun 20, 2023 |
AF - Mode collapse in RL may be fueled by the update equation by Alex Turner
|
Jun 19, 2023 |
AF - Experiments in Evaluating Steering Vectors by Gytis Daujotas
|
Jun 19, 2023 |
AF - Provisionality by Tsvi Benson-Tilsen
|
Jun 19, 2023 |
AF - Revising Drexler's CAIS model by Matthew Barnett
|
Jun 16, 2023 |
AF - [Replication] Conjecture's Sparse Coding in Small Transformers by Hoagy
|
Jun 16, 2023 |
AF - LLMs Sometimes Generate Purely Negatively-Reinforced Text by Fabien Roger
|
Jun 16, 2023 |
AF - MetaAI: less is less for alignment. by Cleo Nardo
|
Jun 13, 2023 |
AF - Virtual AI Safety Unconference (VAISU) by Linda Linsefors
|
Jun 13, 2023 |
AF - TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI by Andrew Critch
|
Jun 13, 2023 |
AF - Contingency: A Conceptual Tool from Evolutionary Biology for Alignment by clem acs
|
Jun 12, 2023 |
AF - ARC is hiring theoretical researchers by Paul Christiano
|
Jun 12, 2023 |
AF - Introduction to Towards Causal Foundations of Safe AGI by Tom Everitt
|
Jun 12, 2023 |
AF - Explicitness by Tsvi Benson-Tilsen
|
Jun 12, 2023 |
AF - Inference-Time Intervention: Eliciting Truthful Answers from a Language Model by likenneth
|
Jun 11, 2023 |
AF - How biosafety could inform AI standards by Olivia Jimenez
|
Jun 09, 2023 |
AF - Takeaways from the Mechanistic Interpretability Challenges by Stephen Casper
|
Jun 08, 2023 |
AF - What will GPT-2030 look like? by Jacob Steinhardt
|
Jun 07, 2023 |
AF - An Exercise to Build Intuitions on AGI Risk by Lauro Langosco
|
Jun 07, 2023 |
AF - A Playbook for AI Risk Reduction (focused on misaligned AI) by HoldenKarnofsky
|
Jun 06, 2023 |
AF - AISC end of program presentations by Linda Linsefors
|
Jun 06, 2023 |
AF - Algorithmic Improvement Is Probably Faster Than Scaling Now by johnswentworth
|
Jun 06, 2023 |
AF - Wildfire of strategicness by Tsvi Benson-Tilsen
|
Jun 05, 2023 |
AF - How to Think About Activation Patching by Neel Nanda
|
Jun 04, 2023 |