Open Evaluation Artifacts

Open-model evaluation artifacts, reorganized for inspection instead of ranking.

OneEval publishes public-safe EvalScope artifacts for Llama, Qwen, and DeepSeek families. It exposes subset-level results, pass@k reasoning views, and sanitized protocol summaries, all repacked into a stable Model → Benchmark → Mode → Run release layout.

No leaderboard or composite score Subset-level results, not just averages Interactive pass@k views for reasoning Sanitized configs only, never raw internal paths

Authors: Xuan Chen Qiuxuan Chen Bo Liu

Records 0
Runs 0
Models 0
Benchmarks 0
Artifacts 0
Updated Loading…
Why OneEval Exists

Reproducibility and detail, not just headline scores

OneEval is a repository-first response to two recurring gaps in open LLM evaluation: reproducibility that is under-specified and benchmark reporting that hides internal detail.

Problem 1

The benchmark may be public, but the exact setup often is not

Open-source evaluation frameworks regularly accumulate issue threads about not being able to match paper or official numbers. In many cases, the missing piece is the concrete evaluation setup rather than the benchmark itself.

Problem 2

Rich benchmarks are frequently reduced to one headline score

MMLU-Pro contains many subsets but is often reported as one average. AIME-style reasoning benchmarks are commonly reduced to pass@64 or one mean, which hides the internal distribution.

Loading…

Evaluation Protocol

Sampling and run setup

Protocols are grouped by model family, then split into CoT vs NoCoT. We only surface the shared sampling knobs that matter here: Temperature, Top-p, Top-k, plus a benchmark-specific repeat count. Qwen-style rows (including DeepSeek-R1-Qwen3) use the normalized repeat policy, while Llama rows stay at repeat 1.

CoT and NoCoT are normalized public labels BFCL v3 uses YaRN context extension to 131072
Loading protocol groups…
Category Entry

Choose a reading path

Loading category navigation…
Repository Files

Browse the published files in the GitHub project

These links point to files that live inside the repository. On GitHub Pages and on the GitHub project itself, readers can inspect the raw-safe JSON exports and artifact folders directly.