Discord-servrar märkta med benchmark

3

29 dagar sen

community for tracking and discussing LLM performance across every major benchmark and leaderboard. If you follow AI model releases, compare scores between ChatGPT, Claude, Gemini, and GPT-5, or just want to know which model actually performs best on real tests, this is the place for that conversation.
We cover the full range of evaluations - reasoning and general intelligence benchmarks like Humanity’s Last Exam, ARC Prize, and ForecastBench; coding benchmarks like LiveCodeBench Pro, Aider, and SWE-Bench; hallucination and factual accuracy leaderboards like the Vectara Hallucination Leaderboard and the RAG hallucination benchmark; alignment and honesty testing like the MASK leaderboard from Scale AI; agentic and long-horizon task benchmarks like Vending Bench and METR Time Horizons; multimodal and vision benchmarks like GeoBench and Video-MMMU; and aggregators like Artificial Analysis, LLM Stats, Chatbot Arena, and Epoch AI.

Anslut