Reproducibility and detail, not just headline scores
OneEval is a repository-first response to two recurring gaps in open LLM evaluation: reproducibility that is under-specified and benchmark reporting that hides internal detail.
The benchmark may be public, but the exact setup often is not
Open-source evaluation frameworks regularly accumulate issue threads about not being able to match paper or official numbers. In many cases, the missing piece is the concrete evaluation setup rather than the benchmark itself.
Rich benchmarks are frequently reduced to one headline score
MMLU-Pro contains many subsets but is often reported as one average. AIME-style reasoning benchmarks are commonly reduced to pass@64 or one mean, which hides the internal distribution.
Loading…
Sampling and run setup
Protocols are grouped by model family, then split into CoT vs NoCoT. We only surface the shared sampling knobs that matter here: Temperature, Top-p, Top-k, plus a benchmark-specific repeat count. Qwen-style rows (including DeepSeek-R1-Qwen3) use the normalized repeat policy, while Llama rows stay at repeat 1.
Choose a reading path
Browse the published files in the GitHub project
These links point to files that live inside the repository. On GitHub Pages and on the GitHub project itself, readers can inspect the raw-safe JSON exports and artifact folders directly.