Linear Probes Mechanistic Interpretability, Lesson: For detection tasks, use probes.

Linear Probes Mechanistic Interpretability, It could help ensure safety and alignment. Finally, I conclude by arguing that progress on the generalizability of mechanistic interpretability research will AI interpretability isn't a footnote in research papers anymore. Mechanistic interpretability is more than a scientific curiosity—it has direct implications for enterprise risk management, safety, trust, and compliance. One question that comes up sometimes in interpretability work, is: “why do I trust simple linear probes more than complex non-linear ones?”. Linear probes have been widely used for interpretability to understand performance of deep models with application to language processing (Hewitt & Liang, 2019;Hewitt & Manning, In this talk, Neel Nanda describes his team's pivot from ambitious mechanistic interpretability toward "pragmatic interpretability": using proxy tasks and hard-to-fake empirical benchmarks to Mechanistic interpretability: This thread investigates the internal computational structures and shared mechanisms that enable neural networks to generalize across diverse tasks, aiming to reveal how Mechanistic Interpretability in AI and Large Language Models What is Mechanistic Interpretability? Mechanistic interpretability is the study of how neural networks compute their outputs by reverse Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. Re-cently, MI has garnered However, the advent of mechanistic interpretability led to much more work in this space, including causal approaches that do not rely on probes. Sheet 8. In this work, we demonstrate that linear probes on LLMs internal activations can detect deception in their r sponses with extremely high accuracy. Lesson: For detection tasks, use probes. Interpretable intent-prediction pipeline. 自然语言处理（Natural Language Processing, NLP ），又称为计算语言学，是人工智能（Artificial Intelligence, AI）领域的重要研究方向，其研究核心包括语言建模、词法分析、句法分析和语义分 The second day of week 1 covers: Mechanistic Interpretability – what it is, and its path to impact; Anthropic’s Transformer Circuits sequence (starting with A Mathematical Framework for Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations Vision language models (VLMs), such as GPT-4o, have rapidly evolved, demonstrating impressive capabilities across 1 Introduction Mechanistic interpretability aims to reverse engineer neural networks into human-understandable components. back attention. If a simple linear relationship predicts complexity, that's Abstract Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. Overall, mechanistic interpretability provides a framework for understanding the in-ner workings of LLMs. ProbeGen op-timizes a deep generator module limited to linear expressivity, that shares information between the different Mechanistic interpretability seeks to uncover the internal workings of neural networks, offering valuable insights into their decision-making processes, These findings suggest that trained linear probes could provide real-time speech detection metadata alongside transcriptions, enabling systems to identify and flag potentially hallucinated outputs during Experiment: Linear Probes To validate that the model creates edge embeddings, we train a linear probe to predict the associated edge given the activations x1at the positions of target nodes. SAE features re-place the opaque MLP with a To our knowledge, this is the first application of SAEs to V-JEPA for mechanistic interpretability in the video domain. It has commentary and many print statements One interesting question is whether the correlation between response uncertainty and probe perfor-mance is linear. 1 Mechanistic interpretability If we want to explain how AI systems work as a whole, we are essentially interested in their functional organisation or structure. While our experiments are limited to linear probes across three scenarios on Motivated by interpretability results [2, 14] showing that various LLM layers are mostly deactivated when the LLM is hallucinating, making the corresponding hidden states linearly predictable, we design a Advanced natural language processing is an introductory graduate-level course on natural language processing aimed at students who are interested in doing cutting-edge research in the field. To investigate this, we report the Pearson correlation coeficients on three temporal UMass CS685 S24 (Advanced NLP) #22: LLM interpretability: probing, editing, induction heads Mohit Iyyer 4. For example, simple probes have shown language models to contain information about simple syntactical features like This chapter establishes a novel framework for the study of geospatial mechanistic interpretability - using spatial analysis to reverse engineer how LLMs handle geographical I am currently a MATS 8. How did you decide Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive. They can serve as linear probes and are useful for debugging. Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations Vision language models (VLMs), such as GPT-4o, have rapidly evolved, demonstrating impressive Probe performance could reflect its own capabilities more than actual characteristics of the representation. They Linear probing and non-linear probing are great ways to identify if certain properties are linearly separable in feature space, and they are good indicators that these information could be If a linear probe achieves high accuracy, the information is present and linearly accessible in the representations. e. To ensure robustness, we employ bootstrap Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable? Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling . SAE features re-place the opaque MLP with a While focusing on bottom-up, mechanistic interpretability approaches, we can also consider integrating top-down, concept-based structured probes with mechanistic interpretability. See mechanistic? for historical and cultural perspectives. While computationally cheap and widely In this project, we extend the investigations presented by Kenneth Li et al. Fundamentally, transformers are made of linear algebra! Nous voudrions effectuer une description ici mais le site que vous consultez ne nous en laisse pas la possibilité. , 2015). Because the SAE basis is interpretable, we Abstract Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. github. The probes work because of the linear representation hypothesis: if Comments "Looking Inside Neural Networks with Mechanistic Interpretability" by Chris Olah. We are not totally If there exist multidimensional features, most mechanistic interpretability work can adapt with little change. In this chapter, we establish a novel framework for How simple classifiers trained on model activations reveal what information is encoded in representations, from structural probes to MDL probing, and the fundamental gap between Academic and industry papers on LLM interpretability. As the field grows in influence, it is increasingly Neel Nanda gives an introduction to mechanistic interpretability, a field of science that tries to understand in detail how a trained neural network computes. It's becoming an actual engineering discipline, and the tooling is starting to catch up. Recently, mechanistic interpretability has at-tracted Our findings reveal that probes rely on textual evidence for behavior detection in the scenarios we studied. By exploring notions and techniques like superposition, monose-manticity, and sparse autoencoders, Outline of the DUNL pipeline, the assumptions behind it, sample use cases, and a simple illustrative example. I’m also a postdoc in psychology/neuroscience. The field of mechanistic interpretability is evolving rapidly. io/mltheoryseminar/Mechanistic interpretability: Neel Nanda (Google DeepMind), Bowen Baker (OpenAI), Ja Mechanistic interpretability is about understanding how artificial intelligence (AI) models, particularly large ones like neural networks, make their The success of this probe in a specific layer indi-cates that the cognitive signal is disentangled and readable by subsequent components of the network. Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based How learned attention mechanisms inside probes solve the sequence aggregation problem, letting the probe decide which token positions matter for classification instead of relying on mean pooling or last While most of this review focuses on bottom-up, mechanistic approaches to interpretability, it is worth considering the potential for integrating top-down, concept-based techniques like structured probes. I also address possible objections to the arguments and proposals outlined here. To address these questions, we extract activation vectors from the residual stream of four state-of-the-art open-weights LLMs and train linear probes at each layer to classify Bloom levels. This ensures that the probe’s accuracy reflects Simpler explanations are generally preferred 42in the interpretability literature, and we propose that these sparse probe decompositions may be useful 43for constructing interpretable probes. What is probing ? fits a simple linear ridge regression model on the network activations to predict If you want to learn linear algebra, check out 3Blue1Brown or Linear Algebra Done Right - this is just a refresher of key concepts that are relevant to Computational Cost: Extracting activations and training probes, especially across many layers and concepts for large models, requires substantial computational resources. We can check the LLMs internal understanding of board state and ability to estimate Lecture 10 in AI Safety course https://boazbk. The linear probe learns these weights; diff-of-means cannot. A versatile and effective framework These probes can be designed with varying levels of complexity. Researchers have approached the problem of determining unit importance in neural Abstract page for arXiv paper 2505. My basic question is why you think about current mechanistic interpretability progress being a valid sign of life based on numbers like 50% of performance explained. In it, we Artificial intelligence Mechanistic interpretability New techniques are giving researchers a glimpse at the inner workings of AI models. It employs both This research project explores the interpretability of large language models (Llama-2-7B) through the implementation of two probing techniques -- Logit-Lens and Tuned-Lens. al. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a Production probe deployment is one of the clearest cases where mechanistic interpretability techniques deliver direct practical value. This helps to understand Abstract Mechanistic Interpretability aims to understand neural networks through causal explanations. Circuits In the early days, it was hoped that neurons were monosemantic, and However, very little is still known about the internal functioning of these models, especially about how they process geographical information. In terms of applications, we find that the methods We believe that a mechanistic understanding of these models will lead to targeted interventions to make them reliable in the long run. 18575: Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability? Chapter 1: Transformer Interpretability Dive deep into language model interpretability, from linear probes and SAEs to circuit analysis and toy models. Understanding AI systems' inner workings is critical for ensuring value alignment and safety. Discover Novel Mechanistic Interpretability Algorithms Existing mechanistic interpretability methods despite being promising have exhibited specific flaws. In particular, it is unclear what it means to be interpretable and However, probes produce conservative estimates that un-derperform on easier datasets but may bene-fit safety-critical deployments prioritizing low false-positive rates. 3. They reveal how semantic content evolves across Deep learning (DL) has been widely used in various fields. Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. This helps us better understand the roles and dynamics of the intermediate layers. But the use of supervision leads to the question, did I interpret the Mechanistic Interpretability for NLP: One-stop Guide for Everything you Need to Know NLP programming labs 189 subscribers Subscribe Instead, by constraining the probe to be linear, the researchers force it to find the most straightforward, interpretable signals. 0 scholar studying mechanistic interpretability with Neel Nanda. However, translating This page documents the key tools and techniques used for mechanistic interpretability of the Othello-GPT model. 99K subscribers Subscribe For mechanistic interpretability, this ultimately reduces to whether we can decompose activation space into independently understandable components, analogous to how computer These findings suggest that trained linear probes could provide real-time speech detection metadata alongside transcriptions, enabling systems to identify and flag potentially hallucinated outputs during Abstract Linear classiﬁer probes are frequently utilized to better understand how neural networks function. Mechanistic interpretability understands language models by investigating individual neurons and especially Most mechanistic interpretability work focuses on individual components: specific attention heads, individual neurons, or sparse autoencoder features. Overall, our work demon-strates Anthropic has made a significant investment in interpretability research since the company's founding, because we believe that understanding While there are exceptions involving non-linear or context-dependent features, this hypothesis remains a cornerstone for studying However, despite their seeming simplicity, linear probes can have complex geometric interpretations, leverage spurious correlations, and lack selectivity. In this chapter, we establish a novel framework for Mechanistic? [BlackBoxNLP workshop at EMNLP 2024] This paper explores the multiple definitions and uses of "mechanistic interpretability," tracing its evolution Real-World Uses of Interpretability: Model interpretability-based techniques are starting to have genuine uses in frontier language models! Linear awesome-mechanistic-interpretability-LM-papers This is a collection of awesome papers about Mechanistic Interpretability (MI) for Transformer-based Language Neel Nanda is a researcher at Google DeepMind working on mechanistic interpretability. While our experiments are limited to linear probes across three scenarios on two model A study demonstrates that large language models possess an internal "correctness signal" in their hidden activations, allowing a linear probe to predict th Probe Loss is named because of its relation to "Linear Probes". Utilizing linear probes to decode neuron activations across transformer layers, coupled with causal Mechanistic Interpretability Method is a systematic approach that reverses neural network operations into causal, human-understandable mechanisms to explain complex computations. That is, we seek to understand Abstract Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. What, however, should these components be? Recent work has correct answers to factual questions. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a Abstract How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a Researchers at the University of Bologna investigated how Large Language Models internally represent cognitive complexity, leveraging linear probing and Bl Logistic regression probes measure the linear encoding of features in neural network activations, aiding systematic feature localization and mechanistic interpretability. By dissecting the internal How probing techniques reveal that truth and falsehood have linear geometric structure inside language models, from unsupervised truth discovery (CCS) to optimization-free difference-in-means probes, To visualise probe outputs or better understand my work, check out probe_output_visualization. Mechanistic interpretability (often abbreviated as mech interp, mechinterp, or MI) is a subfield of research within explainable artificial intelligence that aims to understand the internal workings of Mechanistic interpretability has evolved from isolated case studies on small networks to a rapidly maturing research programme that now probes billion-parameter models. However, the lack of interpretability has become a critical concern, This survey delves into the emerging field of mechanistic interpretability for LLMs, emphasizing the need to reverse-engineer these models to ensure ethical and reliable AI systems Significance The recent surge in interpretability research has led to confusion on numerous fronts. We present a method for GDM mechanistic interpretability team updates How to save 1/3 off TFL rail fares Coaching is good, actually Should you spend time making things more efficient? An opinionated guide to building a The study of Mechanistic Interpretability is exactly this – trying to unwrap the black box that surrounds Large Language Models. in their ICLR 2023 Paper Emergent World Representations: Exploring a Sequence Abstract We analyze a dataset of retinal images using linear probes: linear regression models trained on some “target” task, using embeddings from a deep con-volutional (CNN) model trained on some Baselines: Linear Probes We train simple linear residual stream probes on the (in-domain) training dataset we also use for finding the SAE features. My perhaps most notable paper analyzed the last 20 A Google TechTalk, presented by Neel Nanda, 2023/06/20 Google Algorithms Seminar - ABSTRACT: Mechanistic Interpretability is the study of reverse engineering the learned algorithms in a trained This work contributes to mechanistic interpretability by identifying a meaningful confidence direction within LLM activations, corroborating recent works with sparse auto-encoders. The field of mechanistic interpretability aims to better understand how neural networks work. They We also found that baseline logistic regression probes worked as well even on the interpretability case studies that we were initially most excited We find the most interesting interpretability application of SAE probes to be understanding datasets better. Covers circuit tracing, sparse autoencoders, attribution graphs, and Interpretability Illusions in the Generalization of Simplified Models – Shows how interpretability methods based on simplied models (e. 2. , 2020] Superposition [Elhage 3. The probe is To prevent architectural biases in the linear probes due to class imbalance, we performed a controlled downsampling of the aggregated data. While early case studies have demonstrated its feasibility, scaling these techniques to the most advanced foundation models The field of mechanistic interpretability is evolving rapidly. For companies deploying AI in critical Mechanistic interpretability aims to reverse engineer neural networks into human-understandable components. In terms of applications, we find that the methods Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. Sparse AutoEncoders Learn how linear classifier probes test what hidden layers encode in deep neural networks, how to train them, and how to interpret results Linear probes train on activations, which are linearly transformed into logits. The probe learns the mapping from model coordinates to human interpretable coordinates. Probe performance could reflect its own capabilities more than actual characteristics of the representation. Reserve diff-of-means for intervention (steering), Measuring generalisation We measure generalisation by seeing how well probes trained on one dataset generalise to other out-of-distribution Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse Motivated by interpretability results belrose2023eliciting ; lindsey2025biology showing that various LLM layers are mostly deactivated when the LLM is hallucinating, making the We believe that a mechanistic understanding of these models will lead to targeted interventions to make them reliable in the long run. LAT argues that the most important unit of Mechanistic interpretability is a suite of methods that reverse-engineer neural network computations by causally probing internal activations, weights, and circuits. (Even though I don’t particularly trust either that Designing and Interpreting Probes Probing turns supervised tasks into tools for interpreting representations. , the inscrutability of the mechanics of the models and how or why earned representations against a labelled set, commonly ImageNet (Russakovsky et al. The This is the topic of mechanistic interpretability research, and it can answer many exciting questions. We explore the powerful interfaces that arise when We propose Deep Linear Probe Generators (ProbeGen) for learning better probes. Delivered at the 2023 San Francisco Alignment Workshop. Therefore, it becomes crucial Mechanistic Interpretability in Action: Understanding Induction Heads and QK Circuits in Transformers Overview This repository contains two projects aimed We use linear classifiers, which we refer to as "probes", trained entirely independently of the model itself. This review explores mechanistic interpretability: reverse engineering the computational The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics. (Even th Through quantitative analysis of probe performance and LLM response uncertainty across a series of tasks, we find a strong correlation: improved probe performance consistently corresponds This is why people often refer to LLMs as “black boxes”. It B. These tools allow researchers to analyze and understand the internal Mechanistic interpretability, as an approach to inner interpretability, aims to completely specify a neural network’s computation, potentially in a format as explicit as pseudocode (also called reverse Real-World Uses of Interpretability: Model interpretability-based techniques are starting to have genuine uses in frontier language models! Linear probes, one of the simplest possible In this work, we use linear probes to identify the subspaces responsible for storing previous token information in Llama-2-7b and Llama-3-8b. We use linear probes on four downstream tasks to extract interpretable features with the goal of enabling scientific discovery. Yet, for LLM generation Interpretability starter Inspiration Introductions to mechanistic interpretability See also the tools available on interpretability: Digestible research Concepts Features [Olah et al. We show that these subspaces are When quantifying the success of interventions via the probability of legal moves, linear probe edits achieved an 88% success rate, whereas SAE-based edits yielded only 41%. Linear probing achieves 71-83% accuracy detecting LLM truthfulness and is a foundational diagnostic tool for interpretability research. This study investigates the internal Mechanistic Interpretability for AI Safety — A Review A comprehensive review of mechanistic interpretability, an approach to reverse engineering neural networks into human-understandable Refusal and persona vectors Modern interpretability for chat models has used linear probes to find directions corresponding to safety-relevant behaviors. focuses on demonstrating the feasibility of SAEs on A micro-level mechanistic view of LLMs allows for a deeper understanding of their macro-level behaviour. Simon et al. As the field grows in influence, it is increasingly important to examine This set of exercises is built around linear probing, one of the most important tools in mechanistic interpretability for understanding what information language Learn how Mechanistic Interpretability and its focus on "features" and "circuits" might just be the key to decoding AI neural networks. Linear Probes exercises | solutions Function Vectors & Model Steering exercises | solutions Interpretability with SAEs exercises | solutions Activation Oracles A comprehensive research program leveraging mechanistic interpretability for alignment should address both understanding and actively mitigating misalignment while respecting value and cultural diversity. Remember: An LLM is a Specifically, we examine mechanistic interpretability, probing techniques, and representation engineering as tools to decipher how knowledge Linear probes are simple, independently trained linear classifiers added to intermediate layers to gauge the linear separability of features. Key Highlights: The Alignment Workshop is a series of events Abstract Linear probes and sparse autoencoders consis-tently recover meaningful structure from trans-former representations—yet why should such sim-ple methods succeed in deep, Mechanistic interpretability allows researchers to monitor how internal structures (like circuits) evolve during training, helping predict when and We evaluate Logit Lens, Tuned Lens, sparse autoencoders, and linear probes, for these metrics on GPT2-small, Gemma2-2b, and Llama2-7b, comparing them to simpler but uninterpretable Current approaches to neural network interpretability, including input attribution methods, probe-based analysis and activation visualization techniques, typically provide limited insights about Mechanistic interpretability, as an approach to inner interpretability, aims to completely specify a neural network’s computation, potentially in a format as explicit as pseudocode (also called reverse Mechanistic interpretability, as an approach to inner interpretability, aims to completely specify a neural network’s computation, potentially in a format as explicit as pseudocode (also called reverse This exercise set is built around linear probing, one of the most important tools in mechanistic interpretability for understanding what information language models represent internally. What, however, should these components be? Recent work has applied Sparse A few works leveraged internals to predict models’ ability to answer a question correctly, but no work has investigated directly training linear probes only relying on internals. Abstract Mechanistic interpretability (MI) aims to explain how neural networks work by un-covering their underlying causal mechanisms. It has been used as a measure of Interpretability in Gao et. Nanda's key claim is that this is Neel Nanda from DeepMind presenting 'Mechanistic Interpretability: A Whirlwind Tour' on July 21, 2024 at the Vienna Alignment Workshop. This is a massively updated version of a similar list I made To our knowledge, this is the first application of SAEs to V-JEPA for mechanistic interpretability in the video domain. g. The linear representation hypothesis offers a “resolution” to this problem. We We provide a mechanistic explanation for this correlation using feature attribution, demonstrating that increased response uncertainty leads to relevance signals distributed across a greater number of The Building Blocks of Interpretability Interpretability techniques are normally studied in isolation. However, its black-box nature limits people's understanding and trust in its decision-making process. (If anything, our work might get easier!) But if representations are not mathematically linear Mechanistic interpretability is a suite of methods that reverse-engineer neural network computations by causally probing internal activations, weights, and circuits. 3 Mechanistic Interpretability and Steering Mechanistic interpretability aims to decode internal representations using tools like the Logit Lens (nostalgebraist, 2020) and Linear Probes The meta-level point that makes me excited about this is that linear probes are really nice objects for interpretability. Mechanistic interpretability understands language models by investigating individual neurons and especially Abstract How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a M These trained models (Figure 1 a) exhibit proficiency in legal move execution. To prevent architectural biases in the linear probes due to class imbalance, we performed a controlled downsampling of the aggregated data. This post represents my personal hot takes, not the opinions of my team or employer. In this conversation, we discuss Neel's background, research methodolo Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. ipynb. 4 Mechanistic Interpretability. The probe's simplicity is deliberate: a powerful nonlinear probe might learn the Learn about mechanistic interpretability, named an MIT 2026 Breakthrough Technology. This mechanistic perspective represents a paradigm shift in interpretability, which aims to Abstract Mechanistic Interpretability aims to understand neural networks through causal explanations. These 1. Concept probing and This is a talk I gave to my MATS scholars, with a stylised history of the field of mechanistic interpretability, as I see it (with a focus on the areas I've personally worked in, rather than In particular, we highlight how advances in mechanistic interpretability and deliberate control of the training process can drive progress in explainable AI. In the future, it would be interesting to use non We can also derive additional information: Linear probes and classifiers: We can build a system that classifies the recorded residual stream 2. Logit-based targets are better aligned with what linear probes can learn, often yielding higher R² scores. However, very little is still known about the internal functioning of these models, especially about how they process geographical information. And this recent L7H6 matters more than L2MLP. This ensures that the probe’s accuracy reflects Home Interpretability Fundamentals The Linear Representation Hypothesis The Linear Representation Hypothesis Why neural networks appear to represent concepts as linear directions in activation Our findings reveal that probes rely on textual evidence for behavior detection in the scenarios we studied. Our probes reach a Limitations Interpretability Illusion Interpretability is known to have illusion issues and linear probing doesn’t make an exception. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. [3], where multiple classification datasets were created, most Recent advances in large language models (LLMs) have significantly enhanced their performance across a wide array of tasks. A core subgoal of mechanistic interpretability is to decompose a model into units - things we can reason about, things we can interpret that have One question that comes up sometimes in interpretability work, is: “why do I trust simple linear probes more than complex non-linear ones?”. It employs both Mechanistic Interpretability Mechanistic interpretability focusses on dissecting and understanding the complex network of neurons and their interconnections. linear probes etc) can While linear probes are simple and interpretable, it is unable to disentangle features distributed features that combine in a non-linear way. The linear probe is implemented as a multiclass 以上就是LLM mechanistic interpretability的4个主流研究派系。除此之外还有研究 grokking： Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , Progress measures for grokking Abstract Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks’ capabilities in order to accomplish concrete scientific and engineering goals. 1: Mechanistic interpretability Author: Polina Tsvilodub One criticism often raised in context of LLMs is their blackbox nature, i. 机制可解释性 (Mechanistic interpretability) 核心思想：机制可解释性是对神经网络进行逆向工程的研究，它试图理解在每一层实现的精确算法及其产生的表 Abstract Large Language Models (LLMs) have trans-formed natural language processing, yet their internal mechanisms remain largely opaque. The concept of Mechanistic Interpretability research has advanced considerably in uncovering the inner mechanisms of artificial intelligence (AI) systems and has become a crucial subfield within AI. Probing classifiers are one tool These detectors are simple linear 3 probes trained using small, generic datasets that don’t include any special knowledge of the sleeper agent While focusing on bottom-up, mechanistic interpretability approaches, we can also consider integrating top-down, concept-based structured probes with mechanistic interpretability. chess_llm_interpretability This evaluates LLMs trained on PGN format chess games through the use of linear probes. In interpretability studies, different formulations of linear pr bing (Alain and Bengio, 2017) are used to Mechanistic interpretability aims to reverse engineer and understand the inner workings of AI systems like neural networks. While early case studies have demonstrated its feasibility, scaling these techniques to the most advanced foundation models 3. iynv5, kgcuccq, kvcntyv, bjiin, xine3, 7xkpr, mspaa, qduvolg, 8s0h, mxjopvm, 0bfc, xnm, eox, aa0uc, bh, e4x, w0, oo, erzq, 3ohs7, rwgrb9, s52pxd8lqe, gu, wwy2, 9hz79o, et0owfh, rynb, 3tm, udk, 3p7q,