NeurIPS 2025 Camera Ready

The Best Instruction-Tuning Data are Those That Fit

Dylan Zhang
University of Illinois Urbana-Champaign
Point of Contact: shizhuo2@illinois.edu
Qirun Dai
University of Chicago
Hao Peng
University of Illinois Urbana-Champaign
Read Paper

Summary

A brief overview of how tailoring supervision to the target model creates better instruction-tuned agents.

High-quality supervised finetuning (SFT) data are essential for unlocking LLM capabilities. However, standard SFT data often comes from sources out of distribution for the target model.

We propose GRAPE, a novel framework that hypothesizes that SFT is most effective when data is aligned to the model's pretrained distribution.

The Core Idea

For each instruction, GRAPE gathers responses from various sources and selects the one that aligns most closely to the target model's pretrained distribution (highest normalized probability). We then perform standard SFT on this curated subset.

How GRAPE Works

A simple, effective pipeline to maximize data suitability.

01

Gather Responses

Collect a diverse pool of responses for each instruction from various sources (humans, different LLMs).

02

Measure Alignment

Calculate the length-normalized probability of each response using the target model itself.

03

Select & Finetune

Keep only the highest-probability response for each instruction and perform standard SFT.

Impressive Gains

Outperforming strong baselines and larger datasets.

See paper for full benchmarks
17.3%
Performance Gain

Improvement over baselines trained on 3x more data.

13.8%
Distillation Gain

Beat distilling from the strongest teacher model (Llama 3.1 405B).

1/3
Data Efficiency

Surpassed Tulu3-SFT performance using only a fraction of the data.

SOTA
Generalization

Consistent improvements across coding, math, and logic benchmarks.