Chinese Large Model Leaderboard
by Community
非线智能 NoneLinear - ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及st
OSS
Chinese Large Model Leaderboard
Added 1 June 2026
Overview
A community-maintained benchmark for Chinese large language models, covering 374 commercial and open-source models including GPT, Gemini, Claude, ERNIE, Qwen, and others. It provides a continuously updated leaderboard and a defect library with over 2 million entries for analysis and improvement.
Best for
Best for
Developers and researchers evaluating Chinese large language models.
Use cases
- Compare performance of Chinese LLMs across multiple models
- Identify common defects and weaknesses in large language models
- Track benchmark trends and model improvements over time
Notes
A community-maintained benchmark for Chinese large language models, covering 374 commercial and open-source models including GPT, Gemini, Claude, ERNIE, Qwen, and others. It provides a continuously updated leaderboard and a defect library with over 2 million entries for analysis and improvement.
6,103 stars on GitHub. Last updated 2026-05-30.
Use cases
- Compare performance of Chinese LLMs across multiple models
- Identify common defects and weaknesses in large language models
- Track benchmark trends and model improvements over time
Pros
- Covers a wide range of both proprietary and open-source Chinese LLMs
- Includes a large defect library for deeper analysis
- Regularly updated with community contributions
Cons
- Focused on Chinese language models, limiting global applicability
- Evaluation methodology is community-driven, not formally peer-reviewed
- Interface and documentation are primarily in Chinese
Indexed from awesome-llm and enriched against its public facts.
Pros
- Covers a wide range of both proprietary and open-source Chinese LLMs
- Includes a large defect library for deeper analysis
- Regularly updated with community contributions
Cons
- Focused on Chinese language models, limiting global applicability
- Evaluation methodology is community-driven, not formally peer-reviewed
- Interface and documentation are primarily in Chinese
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
promptfoo
Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config