LLM Benchmark

Detailed Eval Results

How well do leading LLMs write SpacetimeDB code? We prompt each model, run the generated code against live modules, and score with automated checks.

Evals

47 tasks · 5 categories