LLM Benchmark

Detailed Eval Results

How well do leading LLMs write SpacetimeDB code? We prompt each model, run the generated code against live modules, and score with automated checks.

Trends

Task pass rate over time, per model.
No trend data yet

Run the benchmark pipeline to start tracking trends. Each run inserts results into the database, and this page groups them by date.