Discord servers benchmark

Discord-servrar taggad med benchmark

community for tracking and discussing LLM performance across every major benchmark and leaderboard. If you follow AI model releases, compare scores between ChatGPT, Claude, Gemini, and GPT-5, or just want to know which model actually performs best on real tests, this is the place for that conversation.
We cover the full range of evaluations - reasoning and general intelligence benchmarks like Humanity’s Last Exam, ARC Prize, and ForecastBench; coding benchmarks like LiveCodeBench Pro, Aider, and SWE-Bench; hallucination and factual accuracy leaderboards like the Vectara Hallucination Leaderboard and the RAG hallucination benchmark; alignment and honesty testing like the MASK leaderboard from Scale AI; agentic and long-horizon task benchmarks like Vending Bench and METR Time Horizons; multimodal and vision benchmarks like GeoBench and Video-MMMU; and aggregators like Artificial Analysis, LLM Stats, Chatbot Arena, and Epoch AI.