Jepsen Testing
Swytch is tested with Jepsen, the industry-standard framework for verifying the correctness of distributed systems under fault conditions. The test suite lives in a dedicated repository alongside the main codebase.
Jepsen deploys a multi-node Swytch cluster, runs concurrent workloads against it, injects faults (network partitions, process crashes, clock skew), then checks the resulting operation history for consistency violations.
| Workload | What it exercises |
|---|---|
counter | INCR operations across nodes. Verifies that the sum of all successful increments equals the final value on every node after convergence |
set | SADD / SMEMBERS operations. Verifies set membership convergence across partitions |
sorted-set | ZADD / ZRANGE operations. Verifies sorted set convergence |
elle-causal | Transactional reads and writes checked by Elle for causal consistency violations |
The test suite uses phased fault injection to model realistic failure scenarios:
- Normal operation — client ops with no faults (baseline)
- Process kill — a random node is killed and restarted (crash recovery)
- Settle — normal ops while the restarted node catches up
- Network partition + clock skew — majority/minority splits, single-node isolation, and HLC-absorbing clock drift
- Heal — partitions removed, clocks restored
- Settle — normal ops + anti-entropy window for convergence
- Final reads — read from every node to verify agreement
Partition types include:
| Type | Description |
|---|---|
:one | Isolate a single node |
:majority | Split into majority and minority |
:majorities-ring | Overlapping majority partitions |
:island | Every node isolated from every other |
:asymmetric | A cannot reach C, but both can reach B |
The test suite validates these properties:
| Checker | Property |
|---|---|
| Convergence | After heal + settle, all nodes return the same value for every key |
| CRDT counter | Sum of successful INCRs equals final counter value on every node |
| Availability | Commutative operations (reads, writes, INCR, SADD) succeed on the majority side during partitions (fail rate < 50%) |
| Partition effectiveness | Both sides of a partition mutated independently (at least 2 nodes wrote during the partition) |
| Safe transactions | Exactly-once commit — no duplicate transaction effects after merge |
| Partition blocks transactions | In safe mode, zero transactions commit during a partition |
| Fork-choice | After heal, deterministic fork-choice produces identical state on all nodes |
The test suite enforces constraints specific to Swytch’s architecture:
- Kill before partition, never during. Swytch is a cache — killing a partitioned node loses its in-memory data permanently. The test only kills nodes during the normal phase (before partitions start).
- Kill one node at a time. Rebooting the entire cluster while partitioned is not a supported failure mode.
- Grace period for partition detection. After a partition starts, there is a 5-second grace period before checkers expect nodes to have detected the partition via heartbeat timeouts.
- Settle window for anti-entropy. After healing, the test waits for anti-entropy sweeps (every 3 seconds) to propagate any effects missed during the partition.
If you are a contributor to Swytch, you may run jepsen tests on any PR by commenting /jepsen on the PR. Otherwise,
please request it of a contributor.
All historical runs are available on GitHub Actions
Bugs found during Jepsen testing are captured as deterministic unit tests in the Swytch codebase. These reproduce the exact causal DAGs observed during Jepsen runs:
- Snapshot reconstruction — cross-branch snapshots, chained compaction, four-way merges
- Ordered merge — linear chain reordering, divergent index states across nodes
- Counter semantics — INCR convergence across partition/heal cycles
- G0 anomalies — transaction bind ordering under concurrent writes
The effects engine’s core safety properties are also formally verified via TLA+ specifications
(CausalEffectLog.tla, ExactlyOnce.tla).