Performance Discrepancy Report

Analyzing the gap between offline ranking metrics and real-world online performance (Pass@8).

Key Takeaway

Offline ≠ Online: There is a distinct lack of correlation between offline evaluation metrics and online deployment success. High offline rankings often fail to predict the best performing models in real-world scenarios.

1. Offline ≠ Online

Comparing Ranking Positions

High Discrepancy
Observation: Note how the lines cross chaotically. High offline rank often correlates poorly with online rank. Specifically, look at the IS method (Green), which often ranks low offline but jumps to #1 online.

2. The SOTA Method

Online Pass@8 Performance

Winner: IS
Observation: Importance-Sampling (IS) demonstrates superior performance in the online environment, significantly outperforming the standard SFT baseline in larger models.

Detailed Metrics (Pass@8)

Method Offline Rank Offline Pass@8 Online Rank Online Pass@8