Air Labs AI Reviews

Inventing creative benchmarks for AI models.

We design playful, high-signal evaluations that pressure-test how models reason, adapt, and compete. MLB-Bench is our first release—a simulation playground for agentic decision making.

Hi, I'm shaping Air Labs to ship creative evaluation sandboxes that feel like real jobs, not trivia quizzes. These benchmarks force models to plan, adapt, own their decisions, and accept tradeoffs— so you see how they behave when the stakes feel real. If you have any thoughts about testing, developing, questions, or suggestions, please contact me!

Benchmarks

MLB-Bench

Manage a season: agents can set lineups, swap rotations, pinch hit, adjust bullpen stress plans, call up prospects, or explore trades—every decision is logged against results.

More benchmarks

Scenario-driven gauntlets that probe long-horizon planning, tool use, and self-critique. Tell us what you want to measure and we'll build it.