Discovery, Validation & Editing of LLM Mechanisms

Overview

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque, making it difficult to understand, predict, or control their behavior. As LLMs are increasingly deployed in high-stakes settings, this lack of transparency raises serious concerns about reliability and safety. Mechanistic interpretability (MI) has emerged as a promising approach to address this challenge, seeking to reverse-engineer the internal computations of LLMs into human-understandable mechanisms — i.e., an approximate high-level algorithm that the LLM implements with a subset of its components (a circuit) to complete a certain language task or exhibit a certain behavior.

This tutorial provides a comprehensive and up-to-date overview of LLM mechanism discovery, validation, and editing. We begin by introducing foundational concepts — features, components, computational graphs, and circuits — along with key notation. We then examine mechanism discovery through four methodological families: causal mediation, attribution, sparse decomposition, and optimization-based approaches. Next, we turn to mechanism validation, covering methods for verifying proposed mechanisms and emerging standards for rigorous evaluation. Building on these foundations, we survey mechanistic editing techniques that leverage MI insights to modify behavior at varying granularity, from fine-grained representation-level steering to coarser circuit-level interventions. Lastly, we outline open challenges and future research directions, including scalability of interpretability methods, evaluation benchmarks for mechanistic circuits, and the integration of interpretability with training-time objectives.

Mechanistic Interpretability Large Language Models Circuits & Computational Graphs Causal Mediation Model Editing Trustworthy AI

01

Discovery

Surface the circuits and features that implement a behavior — via causal mediation, attribution, sparse decomposition, and optimization.

02

Validation

Rigorously verify that a proposed mechanism is faithful, minimal, and complete — with axiomatic and benchmark-based standards.

03

Editing

Use mechanistic insight to correct knowledge, debug reasoning, unlearn, and mitigate bias at the right level of granularity.

Schedule

Tutorial Outline

A lecture-style tutorial of approximately 160 minutes across six parts, with a 10-minute break after Sections II and IV. Time allocations below are indicative.

I15 min
Introduction: Mechanistic Interpretability in LLMs
- Motivation & significance. Why internal reasoning discovery is critical for applying LLMs to high-stakes scenarios.
- Challenges. Why traditional model-explainability methods fall short of the demand for LLM-internal logic rather than input–output explanations.
- An overview of natural-language tasks studied in LLM mechanistic interpretability.
- An overview of applications that benefit from mechanistic interpretation, verification, and editing.
II10 min
Notions and Background
- The necessity of moving beyond input–output attribution toward internal component-level understanding.
- Fundamental notions. Computational graphs (nodes: attention heads, MLP layers, residual streams; edges: information flow); circuits as functional compositions of components; key research objects — polysemanticity, superposition, and distributed functionalities.
- Causal foundations. Interventions, causal effects, and causal mediation analysis.
☕

Break (10 min)
III40 min
Mechanism Discovery
- Causal mediation. Measuring component importance via targeted interventions.
- Attribution methods. Estimating component importance using approximations.
- Sparse decomposition. Disentangling activations into interpretable features.
- Optimization-based. Finding mechanisms through optimization.
IV40 min
Mechanism Validation
- Formalizing MI. Axiomatic approaches and circuit discovery with self-validated guarantees.
- Validation metrics. Faithfulness, minimality, and completeness of mechanistic interpretations.
- Validation tools. Causal-inference-based techniques and MI benchmarks.
☕

Break (10 min)
V40 min
Mechanism Editing
- Knowledge & reasoning editing. Principled factual correction and reasoning debugging.
- Model unlearning. Mechanistic unlearning for privacy and representation-level steering.
- Safety & alignment. Bias mitigation.
VI15 min
Challenges and Future Directions
- Summary of presented discovery frameworks, validation methodologies, and editing techniques.
- Current challenges in MI, including limitations and gaps in existing approaches.
- Future directions for advancing MI and improving its practical usefulness and scalability.

Who should attend

Target Audience & Prerequisites

The target audience includes researchers, industry practitioners, and students interested in data mining, machine learning, natural language processing, and trustworthy AI. Participants with prior experience in large language models will gain the most from the technical discussions, but the tutorial is structured at an advanced undergraduate / graduate level so that it remains accessible to a broad KDD community. Both academic and industry attendees will be able to follow the key concepts and practical methodologies.

All tutorial slides and supporting resources will be made publicly available after the conference.

Tutors

Presenters

Yinhan He

4th-year Ph.D. candidate, ECE · University of Virginia

Research spanning large language models, agentic AI, interpretable and explainable AI, and graph machine learning, with a particular emphasis on mechanistic interpretability of foundation models. He has studied LLM mechanism discovery, validation, and editing for more than two years; his work on global-level mechanistic interpretability was accepted to ICML 2025, and a collaboration on reasoning editing was accepted to ICLR 2026.

Wendy Zheng

1st-year Ph.D. student, CS · University of Virginia

Research focused on mechanistic interpretability of large language models, aiming to uncover and rigorously evaluate the internal mechanisms behind model capabilities, with a broad interest in scalable interpretability methods that remain effective as models grow. Co-authored works at top venues, including one at NeurIPS on implicit reasoning in LLMs and one at ICML investigating the existence of modular circuits.

Tianyi Zhao

1st-year Ph.D. student, CS · University of Virginia

Broadly interested in the rigorous development of trustworthy AI, with a current focus on leveraging mechanistic interpretability to demystify, evaluate, and refine the internal mechanisms of LLMs. Through this lens she studies fundamental challenges such as hallucination and miscalibration while advancing model-editing techniques. Multiple works in mechanistic interpretability are in submission.

Chen Chen

Assistant Professor, CS · University of Virginia

Previously a research assistant professor at the Biocomplexity Institute (UVA) and a software engineer at Google; Ph.D. from Arizona State University (2019). Her research on the connectivity of complex networks has been applied to healthcare, bioinformatics, recommendation, and critical infrastructure. Work appears in top venues (NeurIPS, ICML, ICLR, KDD, AAAI, IJCAI, SIGIR, WSDM, ICDM, SDM) and journals (PNAS, IEEE TKDE, ACM CSUR, ACM TKDD, KAIS, SIAM SAM), with honors including "Bests of KDD," "Bests of SDM," and Rising Star in EECS.

Jundong Li Corresponding tutor

Associate Professor, ECE & CS · University of Virginia

Research spanning data mining, machine learning, and artificial intelligence, with a particular emphasis on graph machine learning, trustworthy and safe machine learning, and large language models. He has published more than 200 papers in high-impact venues, with over 20,000 citations. Honors include four early-career awards — the ICDM Tao Li Award (2025), the SIGKDD Rising Star Award (2024), the PAKDD Early Career Research Award (2023), and the NSF CAREER Award (2022) — as well as the PAKDD Best Paper Award (2024) and the SIGKDD Best Research Paper Award (2022), and multiple industry faculty research awards.

Resources

Materials

📄 Tutorial Proposal PDF · accepted KDD 2026

🖥️ Slides Available after the conference

📚 Reading List See tutorial proposal

🎥 Recording If made available by KDD

All slides and supporting resources will be released publicly after KDD 2026. Check back here for links.

How to cite

Citation

If you reference this tutorial, please cite the ACM proceedings entry.

@inproceedings{he2026discovery,
  title     = {Discovery, Validation and Editing of Large Language Model
               Mechanisms: Recent Advances and Future Perspectives},
  author    = {He, Yinhan and Zheng, Wendy and Zhao, Tianyi and
               Chen, Chen and Li, Jundong},
  booktitle = {Proceedings of the 32nd ACM SIGKDD Conference on Knowledge
               Discovery and Data Mining V.2 (KDD 2026)},
  year      = {2026},
  address   = {Jeju Island, Republic of Korea},
  publisher = {ACM},
  doi       = {10.1145/3770855.3816458}
}

Discovery, Validation and Editing of Large Language Model Mechanisms

Abstract

Discovery

Validation

Editing

Tutorial Outline

Introduction: Mechanistic Interpretability in LLMs

Notions and Background

Break (10 min)

Mechanism Discovery

Mechanism Validation