Check out our all new pricing plans! Learn more.
The success rates of leading large language models when running code against live modules, both with and without SpacetimeDB.
These scores reflect LLM performance across two categories of common SpacetimeDB coding patterns. We prompt each model for code, run that generated code against live SpacetimeDB modules, and score each model with automated checks. Category percentage represents the average pass rate from each test in the category. See all of the benchmarks on GitHub.
| ModelPopular LLM Models | AverageOverall Task Pass % | BasicsReducers, Tables, CRUD, Index, Helpers | SchemaTypes, Columns, Constraints, Relations, ECS |
|---|---|---|---|
| o4-mini | 63.64% | 66.67% | 60.00% |
| Claude 4 Sonnet | 63.64% | 75.00% | 50.00% |
| Claude 4.5 Sonnet | 59.09% | 58.33% | 60.00% |
| Claude 4.5 Haiku | 58.18% | 66.67% | 48.00% |
| GPT-4o | 54.55% | 66.67% | 40.00% |
| Gemini 2.5 Pro | 40.91% | 50.00% | 30.00% |
| DeepSeek V3 | 40.91% | 41.67% | 40.00% |
| GPT-4.1 | 38.18% | 58.33% | 14.00% |
| Gemini 2.5 Flash | 36.36% | 50.00% | 20.00% |
| DeepSeek R1 | 36.36% | 50.00% | 20.00% |
| Grok 4 | 27.27% | 50.00% | 0.00% |
| Grok 3 Mini (Beta) | 18.18% | 33.33% | 0.00% |
| GPT-5 | 16.74% | 11.11% | 23.50% |
| Meta Llama 3.1 405B | 9.09% | 16.67% | 0.00% |