Published by Dusan Belic on October 15, 2024

LMArena

Categories Assistant

Test large language models (LLMs) by comparing their performance in real-time, side-by-side

LMArena

Formerly called Chatbot Arena, LMArena is a platform designed for benchmarking large language models (LLMs) by comparing their performance in real-time, side-by-side. It allows users to interact with two anonymous models in parallel, enabling them to evaluate which model performs better for a given task.

After each interaction, the names of the models are revealed — providing transparency and insight into the AI capabilities of both open-source and proprietary models.

LMArena uses an Elo rating system — which is similar to what is used in chess — to rank the models. Each interaction contributes to the models’ ratings, helping establish a leaderboard that showcases the best-performing LLMs.

All this has made LMArena a valuable resource for developers and AI enthusiasts to understand which models excel in specific tasks — such as conversation, coding, or complex problem-solving.

Beyond traditional chatbot-style interactions, LMArena also evaluates models for other tasks, such as red-teaming and coding. Moreover, it provides an opportunity for users to test both closed-source and open-source models — including popular ones like ChatGPT and Claude. The system is continuously updated, with new models and features being added regularly.

While LMArena offers a fun way to compare LLMs, it also serves a serious purpose in AI development. By crowdsourcing evaluations, the platform helps democratize AI testing and provide insights into model performance. And that should deliver better AI experiences for everyone… which is a good thing.

LMArena Homepage

Categories Assistant

Video Overview ▶️

What are the key features? ⭐

Interactive AI models: LMArena provides a platform where users can test and compare different AI language models, interacting with them directly.
Model benchmarking: It offers tools to evaluate and benchmark the performance of AI models on various tasks.
Collaboration features: LMArena allows teams to collaborate on model evaluations, enhancing decision-making through shared insights.
Custom model uploads: Users can upload their own models for evaluation and comparison against pre-existing ones.
API access: The tool provides API access for developers to integrate the service with their applications.

Who is it for? 🤔

LMArena is made for AI researchers, developers, and data scientists who need to compare and evaluate various AI language models. It is also useful for tech companies working on AI projects, educational institutions looking to teach AI modeling, and teams that require a collaborative platform to make decisions based on model performance. Moreover, the platform is well-suited for those needing deep insights into language model functionality.

Examples of what you can use it for 💭

Easily compare different AI language models to determine which fits specific project needs
Researchers can use the platform to test new models, gathering data on performance across tasks
Teams working on AI projects can collaborate in real-time to evaluate model efficiency
Developers can upload their own AI models for personalized testing and benchmarking
Educators can use it to teach students how different AI models work and their strengths/weaknesses

Pros & Cons ⚖️

Lets you test and compare two LLMs, side by side
Supports both open- and closed-source models
Helps deliver better AI experiences for all of us

Once you pick the tool to use, you won't return here that often (we don't)

FAQs 💬

What is LMArena?

LMArena is a platform for blind comparisons of AI models, where users vote on responses to build Elo-ranked leaderboards across tasks like text and coding.

How does the voting system work?

Users submit prompts, receive two anonymous responses, and vote for the better one, tie, or skip, with votes updating model scores via Elo ratings in real time.

Which AI models are supported?

It hosts over 400 models from providers like OpenAI (GPT-5), Anthropic (Claude Sonnet 4.5), Google (Gemini 2.5), Meta, and Qwen.

Is LMArena free to use?

Yes, core features including battles and leaderboards are free, with optional premium tiers for faster access during high traffic.

How accurate are the leaderboard rankings?

Rankings derive from millions of user votes, providing human-preference insights, though they may reflect biases toward popular or polished models.

Can I use custom prompts?

Absolutely, users create any prompt for battles, from simple queries to complex multi-turn workflows, tailoring evals to specific needs.

What categories does the leaderboard cover?

Categories include general text, hard prompts, coding, vision, and Arena Expert for domain-specific challenges like math or software.

How does LMArena handle data privacy?

Conversations may share with model providers for improvement, but users should avoid sensitive info as data supports public research.

Are there mobile apps for LMArena?

No dedicated app exists; access via web browser on desktop or mobile for full functionality in battles and voting.

How often are datasets updated?

Datasets release periodically, like the 140k conversations from April-July 2025, available on Hugging Face for analysis.

Visit LMArena

Last update: November 10, 2025

Promote LMArena

Copy Embed Code