Tiered Storage
Swytch’s tiered storage transforms Redis from a cache into a fully durable database. Every write is persisted to the OS page cache immediately, and fsync runs every 10ms. Process crashes lose nothing; power loss loses at most 10 ms of writes. This page explains how tiered storage works and when to use it.
ImportantHigh throughput warning: Tiered storage can saturate disk throughput. Under sustained write loads exceeding 1GB/s, you may experience stalls while waiting for the disk to catch up.
Disk recommendations:
- Recommended: NVMe SSD for write-heavy workloads
- Acceptable: SATA SSD for read-heavy or moderate write workloads
- Not recommended: Spinning disks (random L2 reads will bottleneck)
Traditional Redis offers two persistence options, both with significant trade-offs:
| Feature | Redis RDB | Redis AOF | Swytch Tiered |
|---|---|---|---|
| Durability | Minutes of loss | Seconds of loss | 10ms of loss |
| Write latency | Unaffected | fsync overhead | Batched fsync |
| Recovery time | Fast (snapshot) | Slow (replay) | Fast (indexed) |
| Memory usage | 2x during save | 1x + AOF buffer | 1x (passthrough) |
| Disk usage | Compact | Large (rewrite needed) | Compact (auto-compaction) |
Swytch’s tiered storage is not a cache with persistence bolted on—it’s a database that happens to keep hot data in memory.
Tiered storage uses a two-level architecture:
┌─────────────────────────────────────────────────────────┐
│ Client Request │
└────────────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ L1: Memory Cache │
│ • Lock-free design │
│ • Self-tuning frequency-based eviction │
│ • Hot data stays in memory │
└────────────────────────────┬────────────────────────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ Write Path │ │ Read Path │
│ (Write-through) │ │ (L1 miss → L2 lookup) │
└────────────┬────────────┘ └────────────┬────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ L2: FASTER Log │
│ • Append-only log with indexed lookups │
│ • Memory-mapped for lock-free reads │
│ • Batched fsync every 10ms │
│ • Automatic compaction │
└─────────────────────────────────────────────────────────┘
The L2 layer uses a log structure inspired by Microsoft FASTER—an append-only log with an in-memory hash index for O(1) lookups. Unlike a traditional write-ahead log that replays commands, this stores values directly and rebuilds the index on startup by scanning entry metadata.
In the default write-through mode, every write follows this path:
- Client sends SET command
- Write to L1 cache – Immediate, lock-free
- Write to L2 log – Append to FASTER log
- Batched fsync - Background thread syncs every 10 ms
- Return OK to client – After L2 write completes
The write is durable once step 4 completes. In the worst case (power loss immediately after step 3), you lose at most 10 ms of writes.
Reads check L1 first, falling back to L2:
- Client sends GET command
- Check L1 cache - Lock-free lookup
- If L1 hit – Return immediately
- If L1 miss – Check L2 log (memory-mapped, lock-free)
- If L2 hit – Promote to L1, return value
- If L2 miss – Return nil
This is passthrough semantics: L1 acts as a transparent cache over L2. Data evicted from L1 is never lost—it remains in L2 and can be retrieved on the next access.
- Writes are durable within 10 ms of the write completing
- All acknowledged writes survive restart (except the 10 ms window)
- Reads always see the latest write (strong consistency)
- No data loss – Writes are in the OS page cache and will be flushed to disk
- No data corruption – FASTER log uses checksums
- Fast recovery – Index rebuild from log, not full replay
- At most 10 ms of writes may be lost (the unflushed batch)
- The OS page cache is lost, but fsync runs every 10 ms to persist data
- No data corruption – Partial writes are detected and truncated on recovery
| Scenario | Redis (AOF everysec) | Swytch Tiered |
|---|---|---|
| Normal write | Buffered ~1s | Durable in 10ms |
| Power loss | Lose ~1s of writes | Lose ~10ms of writes |
| Process crash | Lose buffered writes | No data loss |
| Recovery time | Replay entire AOF | Index scan (faster) |
# Basic persistent mode
swytch redis --persistent --db-path=/data/redis.db
# With 4GB memory limit
swytch redis --persistent --db-path=/data/redis.db --maxmemory=4gb
# Defragment log on startup (reclaims space from deleted keys)
swytch redis --persistent --db-path=/data/redis.db --defragment
WarningDisk usage is unbounded. Unlike Redis,
--maxmemoryonly limits the L1 memory cache—it does not limit disk usage. The L2 log file will grow as you write data, just like any database. Swytch does not automatically evict data to free disk space. Monitor disk usage and add storage or delete old data before it fills—if the disk fills up, writes fail silently (data stays in L1 memory but is not persisted to disk).This is a fundamental difference from Redis’s eviction model. In Redis,
maxmemorycaps total data. In Swytch tiered mode,maxmemoryonly caps what’s kept hot in memory—all data lives on disk.
NoteFile locking prevents accidents. Swytch acquires an exclusive lock on the database file at startup. If you accidentally try to start a second instance pointing at the same file, it will fail immediately with:
database is locked by another processThis protects against data corruption during deployments or misconfiguration.
Ghost mode changes the write path to write-back semantics:
- Writes go to L1 only – No immediate L2 write
- L2 write on eviction – When L1 evicts an entry, it’s written to L2
- L2 write on shutdown – Clean shutdown flushes all L1 entries
swytch redis --persistent --ghost --db-path=/data/redis.db
Trade-offs:
| Aspect | Write-Through (default) | Ghost Mode |
|---|---|---|
| Write latency | Higher (L2 write) | Lower (L1 only) |
| Durability | No data loss on crash | May lose unflushed L1 |
| Best for | Databases | Caches with persistence |
NoteGhost mode rarely provides benefits. It only helps when you’re saturating disk I/O throughput. In practice, most workloads are CPU-bound, not disk-bound. The L2 write in write-through mode is a sequential append that modern SSDs handle with minimal overhead. Unless you’ve profiled and confirmed disk is your bottleneck, use the default write-through mode.
Use ghost mode only when:
- You’ve confirmed disk I/O is your bottleneck (not CPU)
- You can tolerate data loss on unclean shutdown
- Data can be reconstructed from another source
On startup, Swytch:
- Opens the FASTER log file
- Rebuilds the index by scanning committed entries
- Validates checksums to detect corruption
- Truncates corrupted tail if any (crash recovery)
This is faster than Redis AOF replay because:
- Index is rebuilt from metadata, not by re-executing commands
- Only committed entries are scanned
- No command parsing or execution needed
Recovery time: Approximately 10 seconds per GB of data. A 10GB database recovers in under 2 minutes. Plan maintenance windows accordingly.
Over time, deleted keys leave holes in the log. How Swytch handles this depends on your operating system.
On operating systems that support hole punching (fallocate with FALLOC_FL_PUNCH_HOLE), Swytch automatically
reclaims space from deleted keys by punching holes in the log file. The file’s apparent size stays the same, but
actual disk usage shrinks.
You can see the difference:
# Apparent size (includes holes)
$ ls -lh redis.db
-rw-r--r-- 1 user user 1.2G Jan 15 10:00 redis.db
# Actual disk usage (excludes holes)
$ du -h redis.db
245M redis.db
In this example, the log has 1.2GB of data written over time, but only 245MB is actually used on disk.
On operating systems without hole punching support, the log file grows unbounded as keys are deleted and new keys are written. Disk space is not reclaimed automatically.
To reclaim space, restart the server with the --defragment flag:
swytch redis --persistent --db-path=/data/redis.db --defragment
This compacts the log in-place, removing deleted entries and reclaiming space. Plan for downtime proportional to your data size.
Even on Linux/macOS, you may want to run manual defragmentation occasionally to:
- Consolidate data for faster sequential reads
- Prepare for backup (smaller file to copy)
- Reset after a large bulk delete operation
# Stop the server, defragment, restart
swytch redis --persistent --db-path=/data/redis.db --defragment
Operators can safely copy the database file while the server is running. The FASTER log format is append-only with checksummed entries, so a copy made during writes will be consistent up to the point of the copy—any partial write at the end is detected and truncated on recovery.
# Copy the database file while server is running
cp /data/redis.db /backup/redis-$(date +%Y%m%d-%H%M%S).db
For the smallest possible backup, defragment first (this compacts the file in-place):
# Stop the server, defragment, then copy
swytch redis --persistent --db-path=/data/redis.db --defragment
# Once started, stop again and copy
cp /data/redis.db /backup/redis-$(date +%Y%m%d-%H%M%S).db
# Restart for production
swytch redis --persistent --db-path=/data/redis.db
To restore from a backup, simply replace the database file:
# Stop the server
# Replace the database file
cp /backup/redis-20240115-100000.db /data/redis.db
# Start the server
swytch redis --persistent --db-path=/data/redis.db
The server will validate checksums and rebuild its index on startup.
Enable metrics to monitor tiered storage:
swytch redis --persistent --db-path=/data/redis.db --metrics-port=9090
Key metrics for tiered storage:
| Metric | Description |
|---|---|
swytch_redis_l2_hits_total | L2 (disk) cache hits |
swytch_redis_l2_misses_total | L2 cache misses |
swytch_redis_l2_writes_total | L2 writes |
swytch_redis_cache_hits_total | L1 cache hits |
swytch_redis_cache_misses_total | L1 cache misses |
swytch_redis_evictions_total | L1 evictions |
swytch_redis_memory_bytes | Current memory usage |
swytch_redis_memory_max_bytes | Configured memory limit |
L2 Hit Rate = l2_hits / (l2_hits + l2_misses)
- High L2 hit rate (>90%): Working set exceeds L1 but fits in L2. Consider increasing memory.
- Low L2 hit rate (<50%): Many requests for non-existent keys, or very large working set.
- L2 hits = 0: All data fits in L1 (ideal for cache workloads).
L2 Write Rate = l2_writes / total_writes
- Should be ~100% in write-through mode
- Lower in ghost mode (writes only on eviction)
Benchmarked on AMD Ryzen 5 3600 (6 cores), 64GB RAM, Samsung NVMe RAID0, Ubuntu 24.04. Using Unix socket, memtier_benchmark with 4 threads, 50 connections, 256-byte values.
| Workload | Write-Through | Ghost Mode |
|---|---|---|
| 100% writes | 247k ops/sec | 397k ops/sec |
| 100% reads | 418-427k ops/sec | - |
| 50/50 mixed | 336k ops/sec | - |
| Workload | Mode | p50 | p99 | p99.9 |
|---|---|---|---|---|
| 100% writes | Write-through | 0.52ms | 3.36ms | 6.50ms |
| 100% writes | Ghost | 0.43ms | 2.19ms | 5.41ms |
| 100% reads | Write-through | 0.42ms | 1.45-1.77ms | 4.19-4.51ms |
| 50/50 mixed | Write-through | 0.44ms | 2.98ms | 5.44ms |
The p99 latency spikes in write workloads reflect SSD write buffer flushes. Under sustained writes, the NVMe’s internal buffer fills and must flush to NAND, causing brief stalls.
- Sequential writes only – No random I/O for writes
- Random reads for L2 lookups - Memory-mapped, OS handles caching
- Batched fsync – One fsync per 10 ms, not per write
- You need durability – Data must survive restarts
- You’re using Redis as a database – Primary data store, not just cache
- Your working set exceeds memory – L2 extends effective capacity
- You need fast recovery – Indexed recovery beats AOF replay
- Pure caching – Data can be regenerated from source
- Maximum throughput – No persistence overhead
- Ephemeral data - Sessions, rate limits, temporary state
Using Swytch as a primary database for user sessions:
# Start with persistence and monitoring
swytch redis \
--persistent \
--db-path=/data/sessions.db \
--maxmemory=2gb \
--metrics-port=9090
import redis
r = redis.Redis(host='localhost', port=6379)
# Store session - durable within 10ms
r.hset('session:abc123', mapping={
'user_id': '42',
'created_at': '2024-01-15T10:00:00Z',
'permissions': 'read,write'
})
r.expire('session:abc123', 86400) # 24 hour TTL
# Read session - from L1 or L2
session = r.hgetall('session:abc123')
Even if Swytch crashes and restarts, the session data survives.
| Aspect | Redis AOF | Swytch Tiered |
|---|---|---|
| Max data loss | 1 second (everysec) | 10ms |
| Write amplification | High (full commands) | Low (values only) |
| Recovery | Replay commands | Index rebuild |
| Compaction | Manual BGREWRITEAOF | Automatic |
| Aspect | Redis RDB | Swytch Tiered |
|---|---|---|
| Max data loss | Minutes | 10ms |
| Memory during save | 2x (fork) | 1x |
| Point-in-time backup | Yes | No (continuous) |
| Aspect | KeyDB/Dragonfly | Swytch Tiered |
|---|---|---|
| Persistence model | Redis-compatible | FASTER log |
| Durability | Same as Redis | 10ms batched |
| Multi-threaded | Yes | Yes |
| Memory efficiency | Similar to Redis | Passthrough (no duplication) |