Check out our all new pricing plans! Learn more.

LLM Code Run on Live Modules

LLM Benchmarks

The success rates of leading large language models when running code against live modules, both with and without SpacetimeDB.

With SpacetimeDB
Without SpacetimeDB
With SpacetimeDB
Without SpacetimeDB
GPT 5.2
Claude Opus 4.5
Gemini 3 pro
Grok Code
025%50%75%100%
Success Rate
Compare Top LLMs on SpacetimeDB Tasks

Task Performance

These scores reflect LLM performance across two categories of common SpacetimeDB coding patterns. We prompt each model for code, run that generated code against live SpacetimeDB modules, and score each model with automated checks. Category percentage represents the average pass rate from each test in the category. See all of the benchmarks on GitHub.

ModelPopular LLM ModelsAverageOverall Task Pass %BasicsReducers, Tables, CRUD, Index, HelpersSchemaTypes, Columns, Constraints, Relations, ECS
o4-mini63.64%66.67%60.00%
Claude 4 Sonnet63.64%75.00%50.00%
Claude 4.5 Sonnet59.09%58.33%60.00%
Claude 4.5 Haiku58.18%66.67%48.00%
GPT-4o54.55%66.67%40.00%
Gemini 2.5 Pro40.91%50.00%30.00%
DeepSeek V340.91%41.67%40.00%
GPT-4.138.18%58.33%14.00%
Gemini 2.5 Flash36.36%50.00%20.00%
DeepSeek R136.36%50.00%20.00%
Grok 427.27%50.00%0.00%
Grok 3 Mini (Beta)18.18%33.33%0.00%
GPT-516.74%11.11%23.50%
Meta Llama 3.1 405B9.09%16.67%0.00%