Dylan Zhang — Ph.D. Student @ UIUC · Language-Model Post-Training

Research vision

I want language models that don't just recall what we know, but help discover what we don't.

My research is on the post-training of language models — the data and algorithms that turn a pretrained model into a capable, reliable reasoner and agent. I work from a data-centric view: much of what looks like a modeling problem is really a question of which experience a model learns from, and how.

That lens runs through my work. I've shown that instruction diversity — not sheer volume — drives generalization; that the best supervised fine-tuning prepares a model for reinforcement learning rather than merely imitating it; and that self-improving agents can quietly corrupt their own memory as they accumulate experience. The connecting thread is understanding the mechanisms behind generalization well enough to engineer it.

Looking forward, I'm most excited about agents that move from recalling knowledge to discovering it: offline-to-online RL, incentivizing proactive reasoning for knowledge discovery, and foundation-model agents that can be dropped into a novel environment and learn it by experiment. The next frontier for post-training, I believe, is building models that extend the frontier of human knowledge — not just compress it.

Post-training data & algorithms

What data and objectives actually make models generalize — instruction diversity, SFT-for-RL, data selection & reweighting.

RL & reasoning

Offline-to-online reinforcement learning and incentivizing proactive, verifiable reasoning behaviors.

Self-improving agents

How agents learn from their own experience — and the failure modes when memory is continually rewritten.

AI for scientific discovery

Agents that experiment, form hypotheses, and recover mechanisms — a step toward AI scientists.

What's new

Jun 2026

New · Writing New interactive write-up: CausaLab — Can LLM Agents Discover Causal Mechanisms by Experiment? Putting agents in a synthetic lab to see whether they can intervene, observe, and revise like scientists.

May 2026

Writing New interactive write-up: Useful Memories Become Faulty When Continuously Updated by LLMs — why agents that compress experience into text can end up worse than no memory at all.

Feb 2026

ICML 2026 Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning was accepted to ICML 2026! 🎉

Sep 2025

NeurIPS 2025 Spotlight GRAPE was accepted to NeurIPS 2025 as a Spotlight! 🎉

May 2025

Started my Student Researcher journey with Google. 🚀

Interactive write-ups

Read my research, the fun way

Visual, scrollable companions to my papers — built to be read in about ten minutes.

Causal discovery

CausaLab: Can LLM Agents Discover Causal Mechanisms by Experiment?

Agents in a synthetic lab — intervening, observing, revising. They predict the right answer with the wrong mechanism, and stop experimenting too soon.

Read write-up →

Agent memory

Useful Memories Become Faulty When Continuously Updated by LLMs

Agents that compress experience into textual lessons can end up worse than the same model with no memory at all — even on problems they already solved.

Read write-up →

Interactive RL · Pilot

GridRule: Self-Proposed Subgoal RL in an ARC-AGI-3-Style Environment

A pilot study — can a 0.8B model learn to decompose multi-step problems by proposing its own subgoals? Compositional generalization, 1.8× baseline, replicated on two seeds.

Read write-up →

Selected works

Publications

2026

Preprint Interactive write-up ↗

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Junlin Yang*, Dylan Zhang*, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng (*equal contribution, project leads)

Preprint Interactive write-up ↗

Useful Memories Become Faulty When Continuously Updated by LLMs

Dylan Zhang, et al.

ICML 2026

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

2025

NeurIPS 2025 · Spotlight

GRAPE

Dylan Zhang, et al.

The Best Instruction-Tuning Data are Those That Fit

Dylan Zhang, Qirun Dai, Hao Peng

Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

Qirun Dai, Dylan Zhang, Jiaqi W. Ma, Hao Peng

Entropy-Regularized Process Reward Model

Hanning Zhang*, Pengcheng Wang*, Shizhe Diao*, Yong Lin, Rui Pan, Dylan Zhang, Pavlo Molchanov, Tong Zhang

ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting

Dylan Zhang*, Rui Pan*, Hanning Zhang*, Xingyuan Pan*, Minrui Xu, Jipeng Zhang, Renjie Pi, Xiaoyu Wang, Tong Zhang (*equal contribution)

2024

Only-IF: Revealing the Decisive Effect of Instruction Diversity on Generalization

Dylan Zhang, Justin Wang, Francois Charton

SciCode: A Research Coding Benchmark Curated by Scientists

Minyang Tian*, Luyu Gao*, Dylan Zhang, Xinan Chen, … (multi-institution collaboration)

2023

Making Large Language Models Better Reasoners with Step-Aware Verifier

Yifei Li, Zeqi Lin, Dylan Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, Weizhu Chen

2021

Pre-training Co-evolutionary Protein Representation via a Pairwise Masked Language Model

Liang He, Dylan Zhang, Lijun Wu, Huanhuan Xia, Fusong Ju, He Zhang, Siyuan Liu, …, Tie-Yan Liu

Experience

Where I've worked

Student Researcher

Google

May 2025 – Present

Mountain View, CA

Research Intern

Microsoft Research

May 2024 – Aug 2024

Redmond, WA

Research Intern

Microsoft Research

May 2023 – Aug 2023

Redmond, WA

Education

🎓

Ph.D. in Computer Science

University of Illinois Urbana-Champaign · Advisor: Prof. Hao Peng (ALTA)

2022 – Present

🌽 Champaign, IL