Summary
A brief overview of how tailoring supervision to the target model creates better instruction-tuned agents.
High-quality supervised finetuning (SFT) data are essential for unlocking LLM capabilities. However, standard SFT data often comes from sources out of distribution for the target model.
We propose GRAPE, a novel framework that hypothesizes that SFT is most effective when data is aligned to the model's pretrained distribution.
The Core Idea
For each instruction, GRAPE gathers responses from various sources and selects the one that aligns most closely to the target model's pretrained distribution (highest normalized probability). We then perform standard SFT on this curated subset.
How GRAPE Works
A simple, effective pipeline to maximize data suitability.
Gather Responses
Collect a diverse pool of responses for each instruction from various sources (humans, different LLMs).
Measure Alignment
Calculate the length-normalized probability of each response using the target model itself.
Select & Finetune
Keep only the highest-probability response for each instruction and perform standard SFT.
Impressive Gains
Outperforming strong baselines and larger datasets.
Improvement over baselines trained on 3x more data.
Beat distilling from the strongest teacher model (Llama 3.1 405B).
Surpassed Tulu3-SFT performance using only a fraction of the data.
Consistent improvements across coding, math, and logic benchmarks.