Jepsen Testing
Swytch is tested with Jepsen, the industry-standard framework for verifying the correctness of distributed systems under fault conditions. The test suite lives in a dedicated repository alongside the main codebase.
Jepsen deploys a multi-node Swytch cluster, runs concurrent workloads against it, injects faults (network partitions, process crashes, clock skew), then checks the resulting operation history for consistency violations.
| Workload | What it exercises |
|---|---|
counter | INCR operations across nodes. Verifies that the sum of all successful increments equals the final value on every node after convergence |
set | SADD / SMEMBERS operations. Verifies set membership convergence across partitions |
sorted-set | ZADD / ZRANGE operations. Verifies sorted set convergence |
elle-causal | Transactional reads and writes checked by Elle for causal consistency violations |
The test suite uses phased fault injection to model realistic failure scenarios:
- Normal operation — client ops with no faults (baseline)
- Process kill — a random node is killed and restarted (crash recovery)
- Settle — normal ops while the restarted node catches up
- Network partition + clock skew — majority/minority splits, single-node isolation, and HLC-absorbing clock drift
- Heal — partitions removed, clocks restored
- Settle — normal ops + anti-entropy window for convergence
- Final reads — read from every node to verify agreement
Partition types include:
| Type | Description |
|---|---|
:one | Isolate a single node |
:majority | Split into majority and minority |
:majorities-ring | Overlapping majority partitions |
:island | Every node isolated from every other |
:asymmetric | A cannot reach C, but both can reach B |
The test suite validates these properties:
| Checker | Property |
|---|---|
| Convergence | After heal + settle, all nodes return the same value for every key |
| CRDT counter | Sum of successful INCRs equals final counter value on every node |
| Availability | Commutative operations (reads, writes, INCR, SADD) succeed on the majority side during partitions (fail rate < 50%) |
| Partition effectiveness | Both sides of a partition mutated independently (at least 2 nodes wrote during the partition) |
| Safe transactions | Exactly-once commit — no duplicate transaction effects after merge |
| Partition blocks transactions | In safe mode, zero transactions commit during a partition |
| Fork-choice | After heal, deterministic fork-choice produces identical state on all nodes |
The test suite enforces constraints specific to Swytch’s architecture:
- Kill before partition, never during. Swytch is a cache — killing a partitioned node loses its in-memory data permanently. The test only kills nodes during the normal phase (before partitions start).
- Kill one node at a time. Rebooting the entire cluster while partitioned is not a supported failure mode.
- Grace period for partition detection. After a partition starts, there is a 5-second grace period before checkers expect nodes to have detected the partition via heartbeat timeouts.
- Settle window for anti-entropy. After healing, the test waits for anti-entropy sweeps (every 3 seconds) to propagate any effects missed during the partition.
Prerequisites: JDK 21+, Leiningen, a Jepsen cluster (5 DB nodes accessible via SSH).
cd swytch.jepsen
# Counter workload with safe-mode nemesis (default)
lein run test --workload counter --nemesis-config safe
# Set workload, no faults (correctness baseline)
lein run test --workload set --nemesis-config none
# Elle causal consistency checker
lein run test --workload elle-causal --nemesis-config safe
# Custom timing
lein run test --workload counter \
--normal-secs 10 \
--fault-secs 30 \
--settle-secs 30 \
--rate 100
| Flag | Default | Description |
|---|---|---|
--workload | counter | Workload: counter, set, sorted-set, elle-causal |
--nemesis-config | safe | Nemesis: none (no faults), safe (partitions + kill + clock) |
--rate | 100 | Operations per second |
--normal-secs | 10 | Seconds of normal operation before faults |
--fault-secs | 30 | Seconds of fault injection |
--settle-secs | 30 | Seconds to settle after healing |
--debug | false | Enable debug logging on Swytch nodes |
--swytch-source | ../cloxcache | Path to Swytch source (builds automatically) |
--swytch-binary | - | Path to pre-built binary (skips build) |
Results are written to store/ with timelines, performance plots, and checker output.
Bugs found during Jepsen testing are captured as deterministic unit tests in the Swytch codebase. These reproduce the exact causal DAGs observed during Jepsen runs:
- Snapshot reconstruction — cross-branch snapshots, chained compaction, four-way merges
- Ordered merge — linear chain reordering, divergent index states across nodes
- Counter semantics — INCR convergence across partition/heal cycles
- G0 anomalies — transaction bind ordering under concurrent writes
The effects engine’s core safety properties are also formally verified via TLA+ specifications
(CausalEffectLog.tla, ExactlyOnce.tla).