Meta's AI Benchmark Win Comes With an Asterisk

Meta claimed a major victory in AI last weekend when its new Llama 4 Maverick model grabbed second place in industry rankings. But there's a catch.

Meta's AI Benchmark Win Comes With an Asterisk

The model that earned those impressive scores isn't the one you can actually use. The drama began when Meta released two new AI models on Saturday - Scout and Maverick. The company boldly declared that Maverick could outperform heavy hitters like GPT-4 and Gemini 2.0 Flash "across a broad range of widely reported benchmarks."

Maverick quickly climbed to the number two spot on LMArena, the go-to site where humans compare different AI models head-to-head. Meta proudly touted Maverick's ELO score of 1417, placing it above OpenAI's offerings and just below Google's Gemini 2.5 Pro.

But eagle-eyed researchers soon spotted something fishy in Meta's documentation. The version of Maverick that achieved those stellar rankings wasn't the same one Meta released to the public. Instead, the company had deployed what it called an "experimental chat version" specifically "optimized for conversationality."

LMArena wasn't pleased. "Meta's interpretation of our policy did not match what we expect from model providers," the site posted on X. They're now updating their policies to prevent similar confusion in the future.

Meta's response? A spokesperson named Ashley Gabriel shrugged it off, saying "we experiment with all types of custom variants." The company maintains they've done nothing wrong and looks forward to seeing how developers use the public version of Llama 4.

The timing of the release raised eyebrows too. AI researchers noted that Saturday releases are unusual in the industry. When questioned about this on Threads, Meta CEO Mark Zuckerberg offered a terse explanation: "That's when it was ready."

The plot thickens with rumors that Meta might have trained its models specifically to ace benchmarks while hiding real limitations. Meta's VP of generative AI, Ahmad Al-Dahle, strongly denied these accusations: "We've heard claims that we trained on test sets โ€“ that's simply not true and we would never do that."

According to The Information, Meta had repeatedly delayed Llama 4's launch because the model wasn't meeting internal expectations. The pressure was especially high after DeepSeek, an open-source AI startup from China, released a model that generated significant buzz.

This episode highlights a growing problem in AI development: benchmark scores are becoming more about winning than honest evaluation. Companies can submit specially-tuned versions of their models for testing while releasing different versions to the public, making these rankings increasingly unreliable as indicators of real-world performance.

Why this matters:

  • Meta's move exposes how easily AI benchmarks can be gamed - it's like bringing a Formula 1 car to a street race, then selling customers a minivan
  • The incident reveals the intense pressure on tech giants to show AI leadership, even if it means bending the rules until they snap

Read on, my dear:

Great! Youโ€™ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to implicator.ai.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.