Microsoft Open-Sourced Durable Execution in Postgres

Microsoft open-sourced a durable execution engine inside Postgres. If your job state is already there, maybe your orchestration should be too.

At some point in the last year, I found myself with a Postgres table called background_jobs — status column, timestamps, retry counts — and a separate Redis-backed job queue also tracking whether those jobs were pending, running, failed, or done.

Two systems. Both storing state. Perpetually slightly out of sync.

I assumed that was just how you did it. Database is for data. Queue is for queues. Keep them separate, accept the coordination tax.

What Microsoft Dropped This Week

pg_durable is a PostgreSQL extension that asks the obvious follow-up question: what if the orchestration just lived inside Postgres too?

The short version: it brings durable execution directly into the database as a native extension. No Temporal cluster to manage. No Redis. No external workers to babysit. You define multi-step workflows in SQL using composable operators — ~> for sequential steps, |=> for parallel execution — call df.start(), and the runtime handles fault tolerance from there.

If a step fails, the workflow resumes from the last checkpoint — not from the beginning. If the database restarts mid-workflow, it picks back up. The state that tracks progress lives in Postgres, because where else would it go.

The project is currently in preview and supports PostgreSQL 17 and 18.

What Durable Execution Actually Means

The term gets thrown around, so let's be precise. A normal background job is fire-and-forget: push a task to a queue, a worker picks it up, runs it, marks it done. If the worker crashes mid-task, you retry from scratch and handle partial state yourself.

Durable execution is for workflows with multiple distinct steps that need to survive interruptions. You want step 3 to resume from where step 3 left off — not restart from step 1. Temporal is the well-known implementation: powerful, battle-tested in production, and a cluster you have to provision and operate.

The use cases in the pg_durable README are instantly recognizable: vector embedding pipelines, data ingest with staging and deduplication, scheduled maintenance with approval steps, fan-out aggregation with parallel queries, external API enrichment. These are exactly the workflows that end up as a mixture of cron jobs, status tables, retry logic bolted onto a queue consumer, and hope that none of it drifts out of sync.

One practical point from the HN discussion that I hadn't thought of: snapshot and PITR backups cover your workflow checkpoints automatically. Your Postgres backup includes the job state. With an external orchestration service, you're running two backup strategies and hoping they stay aligned. One coordination problem just disappears.

The Architectural Turn I'm Watching

The dominant backend philosophy for a while has been: keep the database thin. Postgres is for data. Queues are for queues. Orchestration — Temporal, Airflow, and their cousins — is for orchestration. Each concern gets its own service; services scale independently.

There are real reasons for this. Separate systems can scale separately. Failure in one layer doesn't take down another. It's also a direct reaction to the stored-procedure era, where business logic ended up buried in SQL functions with no version control, no test framework, and no one who remembered writing them. That reaction was justified.

But pg_durable surfaced a cost of strict separation I didn't have clean language for before: every handoff between Postgres and an external system is a consistency boundary. When the queue says "running" and the Postgres status table says "queued," that discrepancy has to be handled somewhere — and that somewhere is usually application code, and that's usually where the subtle bugs live.

The HN thread surfaces the expected pushback: this is how we end up with unmaintainable database logic all over again. I don't think that concern is wrong. I've seen load-bearing Postgres functions with no tests and no documentation. The risk is real.

But a detail from the comments stuck with me: contributors mentioned they're using pg_durable internally at Microsoft for AI workflows, and that it substantially reduced code complexity. For a tool that's been public for about a week, having practitioners say the net was positive is interesting. Not a guarantee, it's a preview — but it's data from people who actually shipped it.

Where I Actually Stand

I've been building real things for about a year now — TypeScript, Postgres, a few projects that needed background jobs. I've built the coordination mess: the status table in Postgres, the separate job queue, the application code that tries to keep them consistent, the bugs in that code.

What I notice from here is that the "thin database" orthodoxy made sense when the alternative was stored procedures as the primary development surface — writing business logic in SQL, committed nowhere, tested nowhere. That was a real problem and people were right to react against it.

But the choice today, for a lot of teams, isn't "put logic in SQL or don't." It's "coordinate between Postgres and three external services, or keep more of it in one place." That's a different tradeoff than the one the orthodoxy was originally responding to. The context has shifted; maybe the rule should too, at least in some cases.

pg_durable is also deliberately bounded about what it's not for: arbitrary application logic, sub-millisecond latency, workflows spanning many heterogeneous systems. It's not trying to replace Temporal for complex distributed orchestration. The scope in the README is unusually honest.

For the middle tier — the pipelines, the scheduled jobs, the workflows that already have a status column in Postgres tracking whether they worked — it's worth asking whether the external complexity is actually load-bearing, or just habit from a problem the architecture doesn't quite have anymore.

Maybe I'm reading too much into a one-week-old preview extension that might quietly stall in preview.

Probably.

I'm going to be watching whether this pattern spreads or stays a Microsoft-internal curiosity.

The stored-procedure debate is decades old. pg_durable is a new entry in it, with cleaner ergonomics and a clearer scope.

My database already knows more about my data than my job queue does.

That used to feel like the natural order of things.

Now it mostly just feels like a design decision someone made.

Microsoft Open-Sourced Durable Execution in Postgres

What Microsoft Dropped This Week

What Durable Execution Actually Means

The Architectural Turn I'm Watching

Where I Actually Stand

Similar reads

I Was Fetching Full Post Bodies Just to Count Words

Google Can't Build Compute Fast Enough. So It Rents.