LLM Benchmark

We gave Claude two backends.
One shipped, one debugged.

Same AI. Same prompts. Same real-time chat app. One built on SpacetimeDB, one on a Postgres stack.

SpacetimeDBvs

3.9×

fewer bugs

3.5vs13.5

31%

lower cost

$12.98vs$18.74

46%

less backend code

777vs1,451

34%

faster build

55 minvs83 min

The Setup

What we built and how

A real-time chat app, built one feature at a time, by the same AI, against two different backends. Here's what that actually means.

The App

A real-time chat web app

Each feature layers on top of the last. By the final level, the app has to keep presence, threads, private rooms, and drafts all in sync across clients in real time.

The Process

Build, test, fix, repeat

Claude works on one feature at a time and can't move on until it passes. Bugs are tracked, fix iterations are counted, and cost is tallied as it goes.

Claude builds the next feature

We test it against the spec

Claude fixes any bugs found

Move to the next feature

The Backends

A database vs. a stack

Both backends have to deliver the same real-time features to the same React client. They just get there differently.

SpacetimeDB

database + reducers

Postgres stack

Postgres + Express + Socket.io + Drizzle

Cost Over Time

Postgres costs compound.
SpacetimeDB stays linear.

SpacetimeDB costs scale linearly with features added. Postgres doesn't. The gap widens as features interact.

SpacetimeDBPostgresRun 1 (dashed)Run 2 (solid)

The Repair Tax

Postgres burns 6× more on fixes

The Postgres stack has more wiring to get right. Every missed emit is a bug. Every bug is another fix loop. Every fix loop eats the budget meant for new features.

Postgres

38%

of spend on fixes

$7.09 fixes$18.74 total

62% building features

38% fixes

SpacetimeDB

of spend on fixes

$1.14 fixes$12.98 total

91% building features

For every $1 SpacetimeDB spent on repairs, Postgres spent $6.22.

Bug Distribution

New features kept breaking Postgres

SpacetimeDB shipped twice as many features bug-free. Postgres bugs clustered where features had to interact with each other.

SpacetimeDBPostgres

Quality Analysis

Most Postgres bugs were real-time failures

One in three Postgres bugs was state failing to sync across clients. SpacetimeDB's subscription model makes that entire category impossible to produce.

SpacetimeDB

7 total bugs across 2 runs

3.5

bugs / run

Real-time state not updating2 bugs

SDK API misuse1 bug

Logic / other4 bugs

Postgres

27 total bugs across 2 runs

13.5

bugs / run

Real-time state not updating9 bugs

Missing UI element5 bugs

Data not persisted5 bugs

Logic / other8 bugs

Backend Code

46% less backend. Zero wiring.

Declarative tables and reducers replace the Express, Socket.io, and Drizzle scaffolding Postgres requires. Fewer moving parts means fewer places for the AI to miss a connection.

Postgres

1,451

SpacetimeDB

777

SpacetimeDB backend is 46% smaller

Postgres Backend

~1,451 lines

−SQL schema migrations (Drizzle)
−Express REST endpoints per feature
−Manual Socket.io room management
−Per-event emit calls; miss one, break real-time
−Auth middleware wired per endpoint

SpacetimeDB Backend

~777 lines

+Declare tables as structs
+Write reducers (functions that mutate state)
+Clients subscribe to queries; updates are automatic
+No WebSocket emit boilerplate
+No SQL query strings

Why this matters for AI: One in three Postgres bugs was real-time state failing to sync across clients. SpacetimeDB's subscription model makes this class of error structurally impossible.

Runtime Performance

The AI-generated code runs faster too

Same chat app, same hardware, both pushed to peak throughput. Raw and optimized. SpacetimeDB leads in both.

As-shippedmsgs/sec

SpacetimeDB

5.3k

Postgres

694

7.6×

AI-optimized

SpacetimeDB

25.3k

Postgres

1.1k

22×

Why the difference: SpacetimeDB processes each message in a single in-process transaction. The Postgres app serializes multiple sequential network round-trips per message. Optimization helps both, but it can't eliminate network physics.

Head-to-Head Results

The numbers, side by side

12 feature levels. Same AI model. Same prompts. Same app requirements. Figures averaged across 2 runs. The only variable was the backend.

Metric	SpacetimeDB	Postgres
Total AI cost to buildAveraged across 2 runs	$12.98	$18.74
Features working first tryNo fix iterations needed	75%	46%
Bugs found per runAveraged across 2 runs	3.5	13.5
Fix iterations per runRepair loops required	2.5	13.5
Cost spent on fixesShare of total budget on repairs	$1.14 (9%)	$7.09 (38%)
Total lines of codeAI-generated, client + server, excl. CSS	2,304	3,288
Backend lines of codeAI-generated, server-side only	777	1,451
LLM API calls per runTotal prompts sent	~395	~666
Total build timeWall-clock, averaged	~55 min	~83 min

Methodology

How we ran this benchmark

Model

Claude Sonnet 4.6 for both backends, both runs

Task

Build a real-time chat app via sequential upgrade: 12 feature levels added one at a time

Controls

Identical prompts and bug-fix steps. Each backend received tailored setup guidelines.

Measurement

Cost tracked via OpenTelemetry instrumentation of the Claude API. All figures averaged across 2 runs. LOC counts exclude CSS and generated bindings.

Grading

Each level was manually tested and graded after generation. Bugs were identified through functional testing of the running app, then fixed via a structured fix prompt before proceeding to the next level.

Runtime

Peak saturated throughput measured over 30 seconds with 20 concurrent writers on the same dev machine. Writers fire as fast as each backend can accept. Two tiers: as-shipped AI output and one-pass AI optimization applied to both. All features preserved across tiers.

The backend you choose determines whether your AI ships features or debugs them.

SpacetimeDB is a database with real-time subscriptions and server-side logic built in. No WebSocket glue, no ORM, no event routing layer.

Install SpacetimeDB LLM Leaderboard

We gave Claude two backends.One shipped, one debugged.

What we built and how

Postgres costs compound. SpacetimeDB stays linear.

Postgres burns 6× more on fixes

New features kept breaking Postgres

Most Postgres bugs were real-time failures

46% less backend. Zero wiring.

The AI-generated code runs faster too

The numbers, side by side

How we ran this benchmark

We gave Claude two backends.
One shipped, one debugged.

Postgres costs compound.
SpacetimeDB stays linear.