Skip to main content
Swytch Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Jepsen Testing

Swytch is tested with Jepsen, the industry-standard framework for verifying the correctness of distributed systems under fault conditions. The test suite lives in a dedicated repository alongside the main codebase.

What Jepsen Tests

Jepsen deploys a multi-node Swytch cluster, runs concurrent workloads against it, injects faults (network partitions, process crashes, clock skew), then checks the resulting operation history for consistency violations.

Workloads

WorkloadWhat it exercises
counterINCR operations across nodes. Verifies that the sum of all successful increments equals the final value on every node after convergence
setSADD / SMEMBERS operations. Verifies set membership convergence across partitions
sorted-setZADD / ZRANGE operations. Verifies sorted set convergence
elle-causalTransactional reads and writes checked by Elle for causal consistency violations

Fault Injection (Nemesis)

The test suite uses phased fault injection to model realistic failure scenarios:

  1. Normal operation — client ops with no faults (baseline)
  2. Process kill — a random node is killed and restarted (crash recovery)
  3. Settle — normal ops while the restarted node catches up
  4. Network partition + clock skew — majority/minority splits, single-node isolation, and HLC-absorbing clock drift
  5. Heal — partitions removed, clocks restored
  6. Settle — normal ops + anti-entropy window for convergence
  7. Final reads — read from every node to verify agreement

Partition types include:

TypeDescription
:oneIsolate a single node
:majoritySplit into majority and minority
:majorities-ringOverlapping majority partitions
:islandEvery node isolated from every other
:asymmetricA cannot reach C, but both can reach B

Checkers

The test suite validates these properties:

CheckerProperty
ConvergenceAfter heal + settle, all nodes return the same value for every key
CRDT counterSum of successful INCRs equals final counter value on every node
AvailabilityCommutative operations (reads, writes, INCR, SADD) succeed on the majority side during partitions (fail rate < 50%)
Partition effectivenessBoth sides of a partition mutated independently (at least 2 nodes wrote during the partition)
Safe transactionsExactly-once commit — no duplicate transaction effects after merge
Partition blocks transactionsIn safe mode, zero transactions commit during a partition
Fork-choiceAfter heal, deterministic fork-choice produces identical state on all nodes

Design Constraints

The test suite enforces constraints specific to Swytch’s architecture:

  • Kill before partition, never during. Swytch is a cache — killing a partitioned node loses its in-memory data permanently. The test only kills nodes during the normal phase (before partitions start).
  • Kill one node at a time. Rebooting the entire cluster while partitioned is not a supported failure mode.
  • Grace period for partition detection. After a partition starts, there is a 5-second grace period before checkers expect nodes to have detected the partition via heartbeat timeouts.
  • Settle window for anti-entropy. After healing, the test waits for anti-entropy sweeps (every 3 seconds) to propagate any effects missed during the partition.

Running the Tests

Prerequisites: JDK 21+, Leiningen, a Jepsen cluster (5 DB nodes accessible via SSH).

cd swytch.jepsen

# Counter workload with safe-mode nemesis (default)
lein run test --workload counter --nemesis-config safe

# Set workload, no faults (correctness baseline)
lein run test --workload set --nemesis-config none

# Elle causal consistency checker
lein run test --workload elle-causal --nemesis-config safe

# Custom timing
lein run test --workload counter \
  --normal-secs 10 \
  --fault-secs 30 \
  --settle-secs 30 \
  --rate 100

CLI Options

FlagDefaultDescription
--workloadcounterWorkload: counter, set, sorted-set, elle-causal
--nemesis-configsafeNemesis: none (no faults), safe (partitions + kill + clock)
--rate100Operations per second
--normal-secs10Seconds of normal operation before faults
--fault-secs30Seconds of fault injection
--settle-secs30Seconds to settle after healing
--debugfalseEnable debug logging on Swytch nodes
--swytch-source../cloxcachePath to Swytch source (builds automatically)
--swytch-binary-Path to pre-built binary (skips build)

Results are written to store/ with timelines, performance plots, and checker output.

Regression Tests

Bugs found during Jepsen testing are captured as deterministic unit tests in the Swytch codebase. These reproduce the exact causal DAGs observed during Jepsen runs:

  • Snapshot reconstruction — cross-branch snapshots, chained compaction, four-way merges
  • Ordered merge — linear chain reordering, divergent index states across nodes
  • Counter semantics — INCR convergence across partition/heal cycles
  • G0 anomalies — transaction bind ordering under concurrent writes

The effects engine’s core safety properties are also formally verified via TLA+ specifications (CausalEffectLog.tla, ExactlyOnce.tla).