Model Performance Analysis: Offline vs Online (Pass@8)

Performance Discrepancy Report

Analyzing the gap between offline ranking metrics and real-world online performance (Pass@8).

Select Model Size

Key Takeaway

Offline ≠ Online: There is a distinct lack of correlation between offline evaluation metrics and online deployment success. High offline rankings often fail to predict the best performing models in real-world scenarios.

1. Offline ≠ Online

Comparing Ranking Positions

High Discrepancy

Observation: Note how the lines cross chaotically. High offline rank often correlates poorly with online rank. Specifically, look at the IS method (Green), which often ranks low offline but jumps to #1 online.

2. The SOTA Method

Online Pass@8 Performance

Winner: IS

Observation: Importance-Sampling (IS) demonstrates superior performance in the online environment, significantly outperforming the standard SFT baseline in larger models.

Detailed Metrics (Pass@8)

Method	Offline Rank	Offline Pass@8	Online Rank	Online Pass@8