Selected Publications
|
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar*, Julianna Piskorz*, David D. Baek, David Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger
Under review at ICML., 2026
arxiv /
Current approaches to detecting hidden communication in AI systems rely on ad-hoc approaches such as inspecting messages for anomalies. We introduce a new framework that instead detects steganography through its behavioral effects; measuring whether a signal helps intended recipients more than outside monitors on real tasks.
|
|
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar*, Tim Baker*, Dana Kianfar, Cristina Pinneri, Christos Louizos
Under review at ICML., 2026
arxiv /
tweetprint /
We propose a simple training objective based on mutual information that prevents CoT obfuscation and maintains CoT monitorability when models are optimized against monitors. Through our theoretical analysis, we also characterize two possible failure modes for practical monitors: information gap, where the monitor cannot interpret the model’s reasoning, and elicitation error, where the monitor fails to correctly evaluate outputs for the target attribute.
|
|
Interpreting Emergent Planning in Model-Free Reinforcement Learning
Usman Anwar*, Thomas Bush*, Stephen Chung, Adria Garriga-Alonso, David Krueger
International Conference on Learning Representations (ICLR 2025), Oral, 2025
arxiv /
We bring concept-based interpretability tools to a model-free reinforcement learning agent (DRC) and uncover that it learns internal plans reminiscent of bidirectional search. This planning circuitry both forecasts the downstream environment state and causally controls action selection, revealing how model-free policies can still leverage structured reasoning when given additional test-time compute.
|
|
Understanding In-Context Learning of Linear Models in Transformers Through an Adversarial Lens
Usman Anwar, Johannes Von Oswald, Louis Kirsch, David Krueger, Spencer Frei
Transactions on Machine Learning Research (Featured Certification, 2025), 2024
arxiv /
We study how in-context learners built from transformers behave when faced with adversarial hijacking attacks. We find both linear and GPT-2 style transformers are surprisingly brittle, but adversarial training during pretraining or finetuning substantially improves robustness and generalizes to stronger attacks. By comparing across architectures and to classical algorithms for fitting linear models, we show that the learned in-context learning procedures differ qualitatively, as evidenced by poor attack transfer even between large models trained with identical recipes.
|
|
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
Usman Anwar and 41 other authors
Transactions on Machine Learning Research (Survey Certification), 2024
arxiv /
tweetprint /
This 150+ pages long agenda identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.
|
|
Reward Model Ensembles Help Mitigate Overoptimization
Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
Internation Conference on Learning Representations, 2024
arxiv /
code /
We show that using an ensmeble of reward models can be effective in mitigating overoptimization.
|
|
Bayesian Methods for Constraint Inference in Reinforcement Learning
Dimitris Papadimitriou, Usman Anwar, Daniel Brown
Transactions on Machine Learning Research, 2022
paper /
poster /
We develop a Bayesian approach for learning constraints which provides several advantages as it can work with partial trajectories, is applicable in both stochastic and deterministic environments and due to its ability to provide a posterior distribution enables use of active learning for accurate learning of constraints.
|
|
Inverse Constrained Reinforcement Learning
Usman Anwar*, Shehryar Malik*, Alireza Aghasi, Ali Ahmed
Internation Conference on Machine Learning, 2021
arxiv /
video /
code /
poster /
slides /
We propose a framework for learning Markovian constraints from user demonstrations in high dimensional, continuous settings. We empirically show that constraints thus learned are general and transfer well to agents with different dynamics and morphologies.
|
|