Every time a major tech company releases a new AI model, its builders brag about how well the model did in benchmark leaderboards.
But in the paper, The Leaderboard Illusion, researchers from Cohere Labs, Princeton, Stanford, MIT, the University of Waterloo and other institutions expose systemic issues in Chatbot Arena, widely considered as the de facto standard for ranking and evaluating generative AI models.
The paper reveals that major AI developers – particularly Meta, Google and OpenAI – benefit from undisclosed private testing, selective score disclosures and disproportionately high sampling rates. These advantages skew Arena rankings, favoring proprietary models while disadvantaging open-weight and open-source alternatives.
While Meta’s flagship AI model family, Llama, is open source, the researchers considered it as proprietary because it also had access to private testing, retraction privileges and high sampling rates.
The authors demonstrate that some access to Arena data significantly boosts leaderboard scores, raising concerns about overfitting to the platform rather than true model quality. They also find that model deprecations often disproportionately affect open-source models, further distorting rankings.
The team calls for reforms to improve transparency and fairness, including mandatory publication of all submissions, standardized sampling, and equitable deprecation practices. Their findings urge the community to reassess how progress is measured in AI and caution against leaderboard-driven development.
Be First to Comment