OneEval

Why OneEval Exists

Reproducibility and detail, not just headline scores

OneEval is a repository-first response to two recurring gaps in open LLM evaluation: reproducibility that is under-specified and benchmark reporting that hides internal detail.

Problem 1

The benchmark may be public, but the exact setup often is not

Open-source evaluation frameworks regularly accumulate issue threads about not being able to match paper or official numbers. In many cases, the missing piece is the concrete evaluation setup rather than the benchmark itself.

Problem 2

Rich benchmarks are frequently reduced to one headline score

MMLU-Pro contains many subsets but is often reported as one average. AIME-style reasoning benchmarks are commonly reduced to pass@64 or one mean, which hides the internal distribution.

Loading…

Evaluation Protocol

Sampling and run setup

Protocols are grouped by model family, then split into CoT vs NoCoT. We only surface the shared sampling knobs that matter here: Temperature, Top-p, Top-k, plus a benchmark-specific repeat count. Qwen-style rows (including DeepSeek-R1-Qwen3) use the normalized repeat policy, while Llama rows stay at repeat 1.

CoT and NoCoT are normalized public labels BFCL v3 uses YaRN context extension to 131072

Loading protocol groups…

Category Entry

Choose a reading path

Loading category navigation…

Repository Files

Browse the published files in the GitHub project

Browse detailed JSON Browse published result tree Browse legacy flat bundle

These links point to files that live inside the repository. On GitHub Pages and on the GitHub project itself, readers can inspect the raw-safe JSON exports and artifact folders directly.

Featured benchmarks in this release