Skip to main content
Swytch Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Designing for Partitions

Swytch is a distributed database, which means you have to think about what happens when nodes can’t reach each other. This page walks through how Swytch handles partitions today, the fundamental race at the heart of partition detection, a specific failure mode worth knowing about, and the design patterns that keep partitions from becoming incidents.


What partitions look like in Swytch today

By default, Swytch runs in safe mode: writes to data whose subscribers can’t all be reached return an error to the application, and the partition shows up as unavailability on the unreachable data. Reads still serve from local state ( the node has what it’s subscribed to, and that subscription was already synchronous) so read paths keep working while writes on affected keys stall.

This is the standard trade. Treating it as a write-availability failure is the conservative choice, and it’s what a reader familiar with Raft-based systems, primary-replica Postgres, or any other leader-style distributed database would expect.

The less-conservative choice (keeping writes available through the partition, letting the two sides diverge, reconciling on heal) is holographic divergence. The implementation exists; the user-facing configuration does not yet. More on that below.


The partition detection race

Every distributed system has the same fundamental problem: detecting a partition means detecting the absence of expected messages, and the absence of messages is indistinguishable from a node that’s slow, a link that’s congested, or a peer that’s genuinely unreachable. You cannot tell these apart from the information available to the detector. You can only set a threshold at which “not yet” becomes “probably won’t.”

Different systems pick different thresholds:

  • Raft-based systems (Cockroach, etcd, Consul) use an election timeout, typically 150–300ms base, randomized to avoid split-vote storms. If a follower hasn’t heard a heartbeat by that threshold, it starts an election.
  • Primary-replica PostgreSQL or MySQL don’t detect partitions themselves; external tools (Patroni, pg_auto_failover, orchestrator) watch replication lag and decide when to failover, typically at the seconds-to-tens-of-seconds scale.
  • Spanner uses TrueTime’s bounded clock uncertainty, collapsing a lot of the race into clock-interval math; typical uncertainty is under 10ms on Google’s hardware.
  • Dynamo-style systems (Cassandra, DynamoDB, Riak) use gossip-based failure detection with configurable suspicion thresholds, typically 5–10 seconds.
  • Redis Cluster defaults to a 15-second node-timeout before declaring a node failing.

Swytch’s threshold is roughly 2× the round-trip time to the furthest subscriber. That could mean 1 second or 5 seconds depending on your cluster geography. The principle: the threshold needs to be high enough that “slow” doesn’t get mistaken for “unreachable,” and low enough that partitions are detected before they cause too much damage. 2× RTT to the furthest subscriber is a reasonable choice in that tradeoff space, landing on the tighter end compared to most of the systems above.

None of these thresholds are magic. They’re all bets about how long “just slow” is allowed to look like “unreachable” before the system decides. A partition can always land inside the window and produce a race with causality. Swytch’s handling of that race is what the rest of this page is about.


Inside the race window: a specific failure mode

The commit path in Swytch is straightforward:

  1. Preflight. The committing node checks that all subscribers are reachable and that no competing local commit is in flight. If both pass, proceed.
  2. Announce. The node broadcasts the commit envelope. At this point, the commit is in the DAG; there’s no back-and-forth confirmation, no two-phase handshake.
  3. Listen for 1 RTT. During this window, a competing envelope from a peer could still arrive. If one does, the node handles the conflict. If none arrives by the end of the window, the commit stands.

Simple enough. The failure mode appears when a partition lands in step 3.

Here’s the sequence:

  • Node A completes preflight successfully. All subscribers were reachable a moment ago.
  • Node A announces the commit. The envelope propagates to every peer Node A can currently reach. Those peers record the commit in their DAG.
  • A partition occurs, isolating Node A from some of its peers.
  • Node A waits through its 1 RTT listen window. No competing envelope arrives (the partition is silencing the peers who would otherwise have sent one).
  • After the suspicion threshold passes, Node A detects the partition and returns ABORT to the application.
  • Meanwhile, on the other side of the partition, a different node may commit something that conflicts with A’s envelope — because that side doesn’t know A’s envelope committed either.

When the partition heals, both sides walk the merged DAG and find two committed envelopes touching the same data. Both are valid by their own side’s history. The application on Node A was told the commit aborted, but every peer that received the announcement before the partition has the envelope recorded as committed. The application’s belief and the database’s state no longer agree.

This race is not unique to Swytch. It’s the same race every distributed database has, just with different specific shapes. A Raft-based system can return success to a client and then lose leadership before the write replicates to a majority; a primary-replica system can acknowledge a write that never reaches the replica before failover. The shape differs; the race doesn’t go away.


Detection and recovery, today

When holographic divergence occurs in the current implementation (whether from the race above or any other cause), the only signal is log output. Both sides of the now-healed partition log that divergence has been detected, with enough detail to identify which keys are affected. There is no programmatic detection API. There is no recovery tooling.

Recovery, right now, is: stop the cluster, delete the database, restart from whatever upstream source of truth your application has. This is not a production-grade recovery path. It is a pre-production acknowledgment that the tooling for this case doesn’t exist yet.

Swytch Cloud will provide the recovery path. Cloud holds the authoritative causal log across regions, can arbitrate between divergent branches, and gives you tools for reconciliation. Until Cloud ships, any deployment that encounters holographic divergence is effectively done.

This is the main reason Swytch is pre-production today. The architecture is sound, the Jepsen tests pass, and the race window is narrow — but without durability and without reconciliation tooling, a rare failure in a running system means starting over. That bar is not compatible with production workloads, and we don’t pretend otherwise.


Holographic mode as a future design axis

Safe mode is conservative and conventional: partitions cost write availability on affected keys, and you get back what Raft users are used to. Holographic mode is the other direction: both sides of a partition keep writing, diverge cleanly, and reconcile when the network heals.

The use case for holographic mode is offline-capable nodes. Field equipment on remote sites. Edge devices on unreliable links. Anywhere the network going down is a scheduled part of the workflow rather than an incident. In those shapes, " writes stall until the network returns" is unacceptable; holographic divergence lets the disconnected side keep working and hands you a reconciliation problem afterward. For those workloads, that’s the better tradeoff.

Holographic mode is implemented but not user-configurable in the current release. If your workload requires offline operation and you want to talk about running holographic mode, email us at holographic@getswytch.com. The capability is real; the self-service configuration for it is a future deliverable.

When holographic mode does become generally available, designing for it will mean designing for the reconciliation. What does “both sides made a valid decision” look like for your data? What does the application do when it discovers two valid histories? Those are questions the application has to answer; the database can show you the divergence precisely, but deciding which branch is canonical is a business problem.


Designing around contention

Whether you’re in safe mode today or holographic mode later, the best mitigation for partition-related drama is to arrange your writes so that contested data has a natural owner. This is a design discipline, not a Swytch feature:

  • Per-region ownership. Data that primarily belongs to one geographic region (a user’s profile if your users don’t roam, regional operational data) gets written from that region. Other regions subscribe and read, but don’t write. Partitions separate the writer from the readers, which costs read freshness but can’t diverge.
  • Per-tenant ownership. A multi-tenant system where each tenant has a home region writes tenant data from that region. Partitions between tenants are harmless; partitions within a tenant’s region are handled by whatever your single-region write path does.
  • Per-device ownership. Devices that generate their own data (IoT telemetry, field devices, mobile clients) write their own records. A partition between a device and the central cluster just means the device queues locally and syncs when the link returns — no conflict because no two writers ever touched the same data.

The principle: minimize the amount of data where two different writers might collide across a partition. Safe mode then has fewer opportunities to fire the race described above, and holographic mode (when you get it) has fewer conflicts to reconcile. This is true of every distributed database; Swytch just makes the reconciliation structure explicit.


Where Swytch actually is

Swytch is not production-ready today, and we’re honest about that:

  • No durability without Cloud. Data lives in RAM across subscribed nodes. Simultaneous loss of all subscribers for a piece of data means that data is gone.
  • No recovery from holographic divergence without Cloud. The rare but real case described above ends in delete-and-restart.
  • Safe mode is the default. The offline-capable mode exists but isn’t self-service.

Swytch Cloud is the path to production: durable storage, cross-region reconciliation, the tools you need when a rare case fires on a running system. Until Cloud ships, Swytch is a pre-production system (useful for understanding the architecture, useful for workloads where the failure modes above are acceptable, not appropriate as a system of record).