Same input. Same prompt. Different output. That's the reality of testing AI agents that write code, and most teams are shipping without solving it.Nick Nisi from
WorkOS tackled this by building eval systems for two AI tools:- npx workos@latest,
a CLI agent that installs AuthKit into your project- WorkOS agent skills that power LLM responses about SSO, directory sync, and RBAC.The post covers how to test against real project structures, score output that's different every time, and catch when your agent makes up methods that don't exist.
Learn more about evals →