CausaLab: Interactive Causal Discovery Toward AI Scientists

TL;DR

CausaLab evaluates two things at once: did the agent solve the task, and is its answer grounded in a faithful recovered causal mechanism? Each episode hides a freshly sampled structural causal model (SCM), so you can't win by reciting memorized causal facts.
Prediction ≠ understanding. On observational 6-node graphs, GPT-5.2-high reaches 92% task accuracy but only 0.47 all-edge F₁. The right number, the wrong graph.
How you experiment is the whole game. Observation narrows the hypothesis space; agent-chosen interventions recover faithful structure. Handing the agent someone else's perfect intervention data ("Golden") boosts the answer but not the mechanism — the act of choosing the experiment is what carries the structural signal.
Agents fail by stopping early. Win or lose, runs leave ~half their intervention budget unused, and failed runs commit to hypotheses that don't even fit their own data. A single "check your theory against your evidence" step lifts 4-node accuracy 48% → 60%.

Why interaction

Two ways an LLM can be "useful" about cause and effect.

Suppose you want to know how a crystal's temperature drives its resonance frequency. An LLM agent can help in two very different ways.

① Retrieve

Recall from Wikipedia or its training data that temperature causes frequency. Fast, often right — and useless the moment the answer lies beyond the current frontier of human knowledge.

good for: settled science

② Discover

Observe measurements, form a hypothesis, design an experiment, intervene, watch what changes, and infer the mechanism from the evidence — the way a scientist works.

good for: the unknown

Both matter. But only ② can push the frontier outward, and only ② is what we mean when we talk about AI scientists. The trouble is that almost every causal-reasoning benchmark tests ① — turning a known causal graph into a quiz. That leaves the "causal parrot" worry wide open [ref]: a model can ace the test by parroting causal facts it has read, never having reasoned causally at all.

CausaLab is built to test ② — and to make it impossible to fake with ①.

The environment

A synthetic laboratory with a hidden law of nature.

Every episode is a tiny, self-contained science problem. A hidden structural causal model — a causal graph plus its structural equations and coefficients — is sampled fresh, and the agent never sees it. It only gets to do experiments and reason from what comes back.

Overview of a CausaLab episode: a hidden SCM generates records; the agent intervenes on a manipulator crystal and observes; it emits a DSL hypothesis at each step; it predicts the held-out reactor frequency, scored on both prediction and recovered mechanism. — **One CausaLab episode.** A hidden SCM generates prior records, a *manipulator* crystal the agent can poke, and a held-out *reactor* crystal governed by the same law. The agent intervenes, observes, writes down its current causal hypothesis, and finally predicts the reactor's hidden `frequency`. We score the prediction *and* the recovered graph + equation against ground truth.

The cover story is deliberately alien — "Quantum Crystals on Planet X," with properties like radiation, temperature, conductivity, and a target frequency — precisely so the agent cannot lean on real-world priors. To win, it has to figure out the wiring from scratch.

The experimental loop

Each episode is a repeated hypothesize → experiment → observe → revise cycle — the same loop that defines empirical science.

1

Read the records

Start from a batch of prior measurements: properties and the resulting frequency from earlier crystals under the same law.

2

Intervene

Set one controllable property on the manipulator crystal through the Property Manipulator. Budget is finite.

3

Observe & revise

See how the other properties and frequency respond. Update the causal hypothesis — direct edge, or path through a mediator?

4

Transfer

Apply the inferred mechanism to a different crystal — the reactor — and predict its hidden frequency.

The design trick

The reactor crystal has different property values from anything the agent saw. So you can't copy an observed frequency — you must recover a mechanism that transfers. And because interventions are shift-style (they nudge a variable's baseline while keeping its upstream causes intact), watching the downstream ripple is genuinely informative about who-causes-whom.

Making thought auditable

A tiny language that turns reasoning into a scored object.

A final number can't tell a lucky guess apart from genuine discovery. So at every step the agent emits a compact DSL record with five fields. Only one of them — the Hypothesis — is parsed into a graph + equation + coefficients and scored against the hidden SCM. That gives us a frame-by-frame movie of the theory the agent is committing to.

DSL record · one step parsed into a candidate SCM

memory cooling lowers freq when cond is high thought isolate qSize from tempC: shift radiation, watch both data {…accumulated observations & intervention outcomes…} hypothesis rad → tempC tempC → cond cond → freq · freq = b + k·cond experiment do_shift(tempC = 360) — does freq move via cond?

Only the hypothesis field is a scored artifact. A deterministic parser turns it into a directed graph and a frequency equation, so the benchmark can grade the agent's evolving theory — not just its last guess. The other four fields are scratch space.

Rendered over a whole trajectory, the DSL becomes a live diagnostic: ground-truth graph on one side, the agent's hypothesis graph on the other, edge precision/recall climbing (or not) as the experiments accumulate.

Trajectory-level visualization: ground-truth causal graph, the agent's hypothesis graph (3 of 5 edges correct), and edge precision/recall curves over the interaction sequence. — **Watching a theory form.** The platform exposes the ground-truth graph, the agent's current hypothesis graph (here, 3 of 5 edges correct), and precision/recall over the interaction sequence. Because every step is logged, we can ask not just *whether* the agent was right but *when* and *why* it committed.

Finding 1

The right answer, the wrong mechanism.

Because CausaLab scores prediction and mechanism separately, we can catch the thing single-number benchmarks can't: an agent that nails the held-out frequency while having recovered a graph that's simply wrong.

92%

task accuracy
(GPT-5.2-high, obs-only, 6 nodes)

0.47

all-edge F₁
same runs

64%

accuracy at 7 nodes
even for the strongest model

4.76

directed SHD at 7 nodes
graph recovery degrades

The two axes don't move together — they come apart in three distinct, controllable ways. Click through:

Hold the 4-node topology fixed; swap the linear mechanism for a hard-quadratic one. GPT-5-mini's task accuracy roughly halves (≈46% → 24%) and the quantitative frequency-weight F₁ collapses with it — yet all-edge F₁ and root-node F₁ hold steady or even rise.

The agent still finds the right parents. It just can't pin the quantitative law that connects them. It loses the mechanism, not the graph.

Linear vs quadratic on matched 4-node topologies: task accuracy and frequency-weight F1 collapse while all-edge F1 and root-node F1 are preserved or rise. — **Linear → hard-quadratic, same topology.** Task accuracy and `frequency`-weight F₁ drop sharply (Δ −22); all-edge and root-node F₁ actually go up. Agents lose the quantitative mechanism, not the qualitative graph.

Prediction accuracy is necessary but not sufficient evidence of mechanism recovery. A causal-reasoning benchmark that scores only the answer cannot tell discovery from a well-tuned guess.

Finding 2 · the heart of it

How you experiment determines what you learn.

This is the part I find most interesting. Causal discovery is not a passive reading task — it's a sequence of choices about what to do next. CausaLab lets us vary the interaction regime while holding the underlying law fixed, and the difference is stark.

High accuracy, hollow understanding. Pure observation often gives the best end-task accuracy on easy graphs — correlations are enough to extrapolate a number. But the recovered graph is weak: 92% accuracy / 0.47 F₁ for GPT-5.2-high on 6 nodes. It guesses well without knowing why.

Prediction-vs-recovery scatter: Obs-only to Mixed arrows for four model/size suites; mixed regimes shift mass toward higher all-edge F1 at comparable task accuracy. — **Obs-only → Mixed, across four suites.** Each arrow goes from observation-only to mixed for one model and graph size. Mixed regimes consistently climb toward higher graph fidelity at comparable or better task accuracy. Intervention is what buys structure.

But is it the data, or the act of choosing?

Here's the sharpest test. We can hand the agent a bounded, high-quality intervention chain — a "Golden" trace of exactly the right experiments — instead of letting it choose its own. If the structural signal lived in the intervention data, this should recover the graph beautifully.

It doesn't. Golden traces lift task accuracy hugely (48% → 90% on 4 nodes, 24% → 44% on 6 nodes) while lowering all-edge F₁. The handed-over experiments behave like stronger observations: they help fit the target equation, but they do not replace the structural signal that comes from the agent running its own intervention loop.

Golden intervention chains: baseline to Golden arrows raise task accuracy sharply but reduce all-edge F1 on both 4-node and 6-node graphs. — **Golden = better answers, not better structure.** Injecting an expert intervention chain moves accuracy right but pulls graph F₁ down. Offline intervention data ≠ online experimental choice.

Observation-conditioned, self-chosen intervention gives the best balance: observations narrow the space, and the agent's own experiments recover faithful structure. The discovery lives in the choosing, not just the data.

Finding 3

Scale helps — but unevenly, and not where you'd hope.

Across the full 3–7 node sweep we compare GPT-5.2-high, GPT-5-mini, and Qwen3.5 with and without thinking. The strongest model wins overall — best accuracy, lowest structural Hamming distance at every size — but the gains concentrate on direct-parent and coefficient fitting. Root-node discovery barely improves on the larger graphs, and even the best model unravels as graphs grow.

Radar of task accuracy across 3–7 nodes for four models; GPT-5.2-high dominates but all models fall off as graphs grow. — **Task accuracy.** GPT-5.2-high (red) leads at every graph size, but every model falls off as the graph grows from 3 to 7 nodes.

Radar of directed all-edge SHD across 3–7 nodes; lower is better; Qwen's SHD rises faster as graphs grow; thinking helps Qwen. — **Directed SHD** (lower is better). Open-weight Qwen's structural error rises fastest with graph size; thinking traces help Qwen recover structure, but don't close the gap to GPT-5.2-high.

Scaling improves direct-parent and coefficient recovery, but does not remove the need for better exploration and mechanism-checking. The bottleneck is behavioral, not just capacity.

Finding 4 · the un-scientific habit

Agents fail by committing before they've checked.

The single most human-recognizable failure: the agents stop experimenting too soon. A good scientist keeps probing until the theory survives its own data. These agents treat a plausible first guess as a finished result — the opposite of skepticism.

The DSL traces let us prove this rather than just assert it. Three observations, in order:

Both successful and failed runs leave roughly half the intervention budget unused. — **Budget left on the table.** Win or lose, runs use only about half their allowed interventions. Failure isn't running out of experiments — it's not running them.

Hypothesis-data match: successful runs 90.8%, failed runs 45.6%. — **Failed theories don't fit their own evidence.** At the moment of commitment, successful runs' hypotheses match the data they collected **90.8%** of the time; failed runs, just **45.6%**.

A single verification step raises 4-node task accuracy from 48% to 60%. — **One check, +12 points.** Adding a single step that asks the agent to verify its final hypothesis against its evidence lifts 4-node accuracy **48% → 60%**.

So the failure mode isn't an exhausted budget or missing data — it's overconfidence. Agents promote an unverified hypothesis to a final theory instead of spending the budget they still have on disambiguating experiments. The fix is almost embarrassingly cheap: a dose of the scientific method, applied as a single "does my theory explain what I've seen?" check.

Why this is encouraging

A capacity ceiling would be bad news. A behavioral gap is good news: it means the models often can discover the mechanism — they just don't keep going. Scaffolding that enforces "experiment, then verify before committing" is a tractable lever, not a new pre-training run.

Many failures are not caused by an exhausted budget; they are caused by committing before checking whether the proposed mechanism explains the evidence already collected.

The bigger picture

What this says about AI scientists.

The dream of an AI scientist is not a model that recites known laws faster. It's an agent that can stand at the edge of what's known, design an experiment, run it, and revise its theory toward a mechanism that transfers. That capacity — interactive causal discovery — is exactly what CausaLab isolates and measures.

Read together, the four findings sketch a clear profile of where today's agents stand on that path:

They can extrapolate without explaining. High accuracy with low graph fidelity means an agent that predicts the next reading but couldn't tell you which knob to turn to change it — useless for intervention-driven science.
Their discovery lives in the doing. Self-chosen interventions recover structure that even perfect handed-over data does not. An AI scientist has to be the one designing the experiments.
They quit while ahead. The gap between current agents and good scientists is, in large part, a willingness to keep testing. That's a scaffolding problem, and a solvable one.

CausaLab joins a growing line of interactive scientific-agent and causal-discovery environments [refs]. What it adds is the insistence on transfer — learn the mechanism on one crystal, apply it to another — and an auditable trace of the theory the agent commits to at every step. That combination is what lets us separate predictive luck from causal understanding.

Takeaways

If you build or benchmark causal agents.

Score the mechanism, not just the answer. Prediction accuracy hides causal-parrot behavior. Grade the recovered graph and equation against ground truth.
Sample the world fresh. A per-episode hidden SCM removes the memorized-fact shortcut that public causal corpora leave open.
Let the agent choose its experiments. Offline intervention data is not a substitute for online experimental choice. Test the loop, not just the dataset.
Demand transfer. A mechanism you can't apply to a new instance isn't a mechanism — it's a fitted curve. Hold out a fresh case.
Build in verification. The cheapest reliability win here was forcing the agent to check its theory against its evidence before committing. Don't let it stop early.

Discovery is interaction, and interaction is the gap.

CausaLab is a controlled stress test, not a verdict on causal reasoning in the wild — synthetic 3–7 node SCMs, mostly linear, a handful of models, shift-style interventions. Within that scope, it draws a clean line: current LLM agents can collect evidence and predict a held-out value while recovering an incomplete or wrong mechanism, and they routinely stop experimenting before the evidence warrants.

The encouraging reading is that the missing ingredient is largely behavioral — experiment more, verify before committing, prefer self-chosen interventions. Those are things we can scaffold. Getting them right is a real step toward agents that don't just know our science, but help extend it.