FELM
by Community
FELM: Benchmarking Factuality Evaluation of Large Language Models
OSS
FELM
Added 1 June 2026
Overview
FELM is a benchmark for evaluating how factually accurate large language models are. It provides a standardized dataset and methodology to measure factuality across different models and tasks.
Best for
Best for
Researchers and developers needing a standardized way to measure LLM factuality
Use cases
- Assessing factual accuracy of LLM outputs in research
- Comparing factuality performance across multiple models
- Validating model improvements in truthfulness
Notes
FELM is a benchmark for evaluating how factually accurate large language models are. It provides a standardized dataset and methodology to measure factuality across different models and tasks.
Use cases
- Assessing factual accuracy of LLM outputs in research
- Comparing factuality performance across multiple models
- Validating model improvements in truthfulness
Pros
- Provides a structured, reproducible evaluation framework
- Focuses specifically on factuality, a critical quality metric
- Community-driven benchmark with transparent methodology
Cons
- Limited to the specific tasks and datasets in the benchmark
- May not cover all real-world factuality challenges
- Requires familiarity with benchmarking tools and setup
Indexed from awesome-llm and enriched against its public facts.
Pros
- Provides a structured, reproducible evaluation framework
- Focuses specifically on factuality, a critical quality metric
- Community-driven benchmark with transparent methodology
Cons
- Limited to the specific tasks and datasets in the benchmark
- May not cover all real-world factuality challenges
- Requires familiarity with benchmarking tools and setup
Pairs with
Other entries in the index that connect to this one. Click through to see the chain.
OpenAI Evals
Community
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
lm-evaluation-harness
Community
A framework for few-shot evaluation of language models.
Ragas
Community
Supercharge Your LLM Application Evaluations 🚀
promptfoo
Community
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, DeepSeek, and more. Simple declarative config