Performance Discrepancy Report
Analyzing the gap between offline ranking metrics and real-world online performance (Pass@8).
Key Takeaway
Offline ≠ Online: There is a distinct lack of correlation between offline evaluation metrics and online deployment success. High offline rankings often fail to predict the best performing models in real-world scenarios.
1. Offline ≠ Online
Comparing Ranking Positions
Observation: Note how the lines cross chaotically. High offline rank often correlates poorly with online rank. Specifically, look at the IS method (Green), which often ranks low offline but jumps to #1 online.
2. The SOTA Method
Online Pass@8 Performance
Observation: Importance-Sampling (IS) demonstrates superior performance in the online environment, significantly outperforming the standard SFT baseline in larger models.
Detailed Metrics (Pass@8)
| Method | Offline Rank | Offline Pass@8 | Online Rank | Online Pass@8 |
|---|