If you asked most people what one of the best AI model is, there’s a very good likelihood most individuals would answer ChatGPT. While there shall be many players on the scene in 2024, LLM OpenAI is the one which has truly broken through and introduced powerful generative AI to the masses. And so it happens that ChatGPT’s large language model (LLM), GPT, has consistently been at the highest of its competitors, from the introduction of GPT-3.5, through GPT-4, and now GPT-4 Turbo.
However, the tide appears to be turning: this week, Anthropic’s Claude 3 Opus, LLM, overtook GPT-4 for the primary time in the Chatbot Arenawhich prompted app creator Nick Dobos to declare: “The king is dead.” If you check the leaderboard At the time of writing, Claude still has an advantage over GPT: Claude 3 Opus has an Arena Elo rating of 1253, while GPT-4-1106-preview has an Arena Elo rating of 1251, closely followed by GPT-4-0125- preview, with a ranking of 1248.
For what it’s worth, Chatbot Arena ranks all three LLMs in first place, but Claude 3 Opus has a slight edge.
Anthropic’s other LLMs are also performing well. Claude 3 Sonnet ranks fifth on the list, just behind Google’s Gemini Pro (both in fourth place), while Claude 3 Haiku, Anthropic’s less powerful LLM for high-performance computing, ranks just behind but only above version 0613 of GPT-4 0613 GPT-4.
How Chatbot Arena evaluates LLM
To rank the various LLMs currently available, Chatbot Arena asks users to enter a prompt and rate how two different, unnamed models will react. Users can continue the conversation to evaluate the difference between them until they decide which model they think works better. Users don’t know which models they’re comparing (you can compare Claude with ChatGPT, Gemini with Lama Meta, etc.), which eliminates any brand preference bias.
However, unlike other types of benchmarks, there is no real rubric by which users can evaluate their anonymous models. Users can simply decide for themselves which LLM performs better, based on whatever metrics they care about. As artificial intelligence researcher Simon Willison says in an interview with Ars Technica, most of what makes LLMs work better in the eyes of users is more “vibes(*3*)https://mashable.com/article/openai-gpt-4-release-date-announcement” goal=”_blank” rel=”noopener” title=”(opens in a brand new window)”>The LLM itself is already a year old, leaving out iterative updates like GPT-4 Turbo, and Claude 3 was released this month. Who knows what will happen when OpenAI introduces GPT-5 which, at least according to one anonymous CEO, reads: “…really good, like materially higher.” For now, there are many generative AI models, each of which is almost equally effective.
Chatbot Arena gathered over 400,000 human votes to rank these LLMs. You can try the test yourself and add your vote to the rankings.
Credit : lifehacker.com