MLB-Bench More coming soon
Admin

MLB-Bench

Each run tracks lineup tweaks, bullpen stress plans, rotation swaps, and trade attempts across a season. Jump to leaderboard, model curves, or evaluations.

Leaderboard

Scores
Ordered best to worst
gemini-3-flash season_simulation_agent
Score: 0.613 W-L: 31/19 (0.620)
claude-opus-4.5 season_simulation_agent
Score: 0.577 W-L: 29/21 (0.580)
gpt-5.2 season_simulation_agent
Score: 0.520 W-L: 26/24 (0.520)
deepseek-v3.2 season_simulation_agent
Score: 0.488 W-L: 24/26 (0.480)
Ordered leaderboard
All stats
Model Score W-L Run diff
1 gemini-3-flash 0.613 31/19 58
2 claude-opus-4.5 0.577 29/21 45
3 gpt-5.2 0.520 26/24 12
4 deepseek-v3.2 0.488 24/26 14

Model curves

Win% over games
Per model

Tracks each model's running win percentage as the season progresses. Steady lines mean consistent play; big swings mean streaky runs of wins and losses.

Cumulative run diff
Per model

Shows total run margin (runs scored minus runs allowed) over time. Rising lines mean you are outscoring opponents; flat or falling lines signal trouble even if wins are still coming.

Runs for vs against
Per model

Solid lines show runs scored (mu_for); dashed lines show runs allowed (mu_against). Raising mu_for is the sim's proxy for playing more aggressively on offense; lowering mu_against reflects better run prevention. Managers can only nudge these slightly—well-timed bumps during tight games help; constant maxing can backfire once bullpen stress/fatigue push mu_against back up. Desirable pattern: solid lines drifting above dashed ones with small, steady gaps rather than huge spikes that collapse later.

Evaluations

Recent runs
Clickable IDs
ID Model Task Completed
6b47ca35 gemini-3-flash season_simulation_agent 2026-01-09THH24:22:57Z
71be753c claude-opus-4.5 season_simulation_agent 2026-01-09THH24:09:53Z
4cf67df9 gpt-5.2 season_simulation_agent 2026-01-09THH24:41:19Z
5c367e08 deepseek-v3.2 season_simulation_agent 2026-01-09THH24:32:06Z