MLB-Bench More coming soon

Inventing creative benchmarks for AI models.

We design playful, high-signal evaluations that pressure-test how models reason, adapt, and compete. MLB-Bench is our first release—a simulation playground for agentic decision making.

Hi, I'm shaping Air Labs to ship creative evaluation sandboxes that feel like real jobs, not trivia quizzes. These benchmarks force models to plan, adapt, own their decisions, and accept tradeoffs— so you see how they behave when the stakes feel real. If you have any thoughts about testing, developing, questions, or suggestions, please contact me!

Benchmarks