If you ask the public what the best artificial intelligence model is, most people will probably answer ChatGPT. While there are many players in 2024, OpenAI's LLM is truly groundbreaking and introduces powerful generative artificial intelligence to the masses. In fact, from the launch of GPT-3.5 to GPT-4 and now to the current GPT-4 Turbo, ChatGPT's Large Language Model (LLM) GPT has always been among the best among its peers.
But the tide seems to be turning: This week, Anthropic's LLM Claude 3 Opus surpassed GPT-4 for the first time on Chatbot Arena, prompting app developer Nick Dobos to declare: "The king is dead." If you look at the results as of this writing, In the current rankings, Claude still has an advantage over GPT: Claude 3 Opus’s Arena Elo ranking is 1253, while GPT-4-1106-preview’s ranking is 1251, followed by GPT-4-0125-preview, ranking 1248.
Regardless, Chatbot Arena ranks all three LLMs in first place, but Claude 3 Opus does have a slight edge.
Anthropic’s other LL.M.s also performed well. Claude 3 Sonnet ranks fifth on the list, just below Google's Gemini Pro (both ranked fourth), while Anthropic's low-end LLM Claude 3 Haiku for efficient processing ranks just below GPT-4's 0613 version, but only GPT-4 0613 or above.
How Chatbot Arena ranks LL.M.
To rank the various LL.M.s currently available, Chatbot Arena asked users to enter a prompt and judge how two different, unnamed models responded. Users can continue chatting to evaluate the differences between the two until they decide which model they think performs better. Users don't know which models they are comparing (you might compare Claude to ChatGPT, Gemini to Meta's Llama, etc.), which removes any bias due to brand preference.
However, unlike other types of benchmarks, users have no real criteria by which to evaluate their anonymized models. Users can decide for themselves which LLM performs better based on any metric they care about. As AI researcher Simon Willison told Ars Technica, much of the reason the LLM performs better in the eyes of users has to do with "vibe" more than anything else. If you prefer Claude's response to ChatGPT, that's what really matters.
Most importantly, it is a testament to how strong these LL.M.s are. If you had given the same test a few years ago, you might have looked for more standardized data to determine which LLM was stronger, whether in speed, accuracy, or consistency. Now, Claude, ChatGPT, and Gemini have become so good that they are almost interchangeable, at least as far as general generative AI usage is concerned.
While it's impressive that Claude outperformed OpenAI's LLM for the first time, it's even more impressive that GPT-4 lasted this long. LLM itself is a year old, minus iterative updates like GPT-4 Turbo, and Claude 3 launched this month. Who knows what will happen when OpenAI launches GPT-5, which at least according to an anonymous CEO is "...really good, like materially better." Currently, there are several types of generative AI models, each model is equally efficient.
Chatbot Arena has collected over 400,000 votes from people to rank these LL.M. You can try the test yourself and add your voice to the rankings.