LLM Code Run on Live Modules

LLM Benchmarks

The success rates of leading large language models when running code against live modules, both with and without SpacetimeDB.

With SpacetimeDB

Without SpacetimeDB

With SpacetimeDB

Without SpacetimeDB

GPT 5.2

Claude Opus 4.5

Gemini 3 pro

Grok Code

025%50%75%100%

Success Rate

Compare Top LLMs on SpacetimeDB Tasks

Task Performance

These scores reflect LLM performance across two categories of common SpacetimeDB coding patterns. We prompt each model for code, run that generated code against live SpacetimeDB modules, and score each model with automated checks. Category percentage represents the average pass rate from each test in the category. See all of the benchmarks on GitHub.


ModelPopular LLM Models	AverageOverall Task Pass %	BasicsReducers, Tables, CRUD, Index, Helpers	SchemaTypes, Columns, Constraints, Relations, ECS
Claude Opus 4.6	89.39%	91.67%	86.67%
Claude Sonnet 4.6	89.39%	91.67%	86.67%
Gemini 3.1 Pro	84.09%	81.94%	86.67%
Gemini 3 Flash	81.82%	83.33%	80.00%
Grok Code	81.82%	86.11%	76.67%
DeepSeek Chat	77.58%	83.33%	70.67%
GPT-5-mini	66.06%	83.33%	45.33%
Grok 4	65.91%	70.83%	60.00%
GPT-5.2-Codex	62.12%	50.00%	76.67%
DeepSeek Reasoner	50.76%	56.94%	43.33%