Observability
Swytch provides Prometheus metrics, OpenTelemetry tracing, and structured logging for production observability.
Enable the metrics endpoint with --metrics-port:
swytch redis --metrics-port=9090
This exposes two HTTP endpoints:
| Endpoint | Description |
|---|---|
/metrics | Prometheus scrape endpoint (OpenMetrics format) |
/health | Returns HTTP 200 with body ok |
These metrics are collected from the Redis server and cache engine:
| Metric | Type | Labels | Description |
|---|---|---|---|
swytch_redis_connections_current | Gauge | Current client connections | |
swytch_redis_connections_total | Counter | Total connections since start | |
swytch_redis_commands_total | Counter | command | Commands processed by type |
swytch_redis_cache_hits_total | Counter | Cache hits | |
swytch_redis_cache_misses_total | Counter | Cache misses | |
swytch_redis_cache_hit_rate | Gauge | Hit rate (0-1) | |
swytch_redis_evictions_total | Counter | Cache evictions | |
swytch_redis_items_count | Gauge | Items in cache | |
swytch_redis_memory_bytes | Gauge | Memory used by cache (bytes) | |
swytch_redis_memory_max_bytes | Gauge | Configured max memory (bytes) | |
swytch_redis_latency_seconds | Gauge | operation, quantile | Latency p50/p99 for get, set, cmd |
swytch_redis_adaptive_k_threshold | Gauge | shard | Per-shard eviction protection threshold |
swytch_redis_bytes_read_total | Counter | Network bytes read | |
swytch_redis_bytes_written_total | Counter | Network bytes written | |
swytch_redis_uptime_seconds | Gauge | Server uptime | |
swytch_redis_blocked_clients | Gauge | Clients blocked on BLPOP, BRPOP, etc. | |
swytch_redis_command_errors_total | Counter | command | Command errors by type |
The latency_seconds metric reports six time series:
operation | quantile | Meaning |
|---|---|---|
get | 0.5 | GET p50 latency |
get | 0.99 | GET p99 latency |
set | 0.5 | SET p50 latency |
set | 0.99 | SET p99 latency |
cmd | 0.5 | All commands p50 latency |
cmd | 0.99 | All commands p99 latency |
When running in cluster mode, additional metrics are exposed via the default Prometheus registry:
| Metric | Type | Labels | Description |
|---|---|---|---|
cluster_lsl_ms | Gauge | peer | Light-speed latency: minimum observed HLC delta (ms) |
cluster_protocol_overhead_ms | Gauge | peer | Extra latency beyond LSL |
| Metric | Type | Labels | Description |
|---|---|---|---|
cluster_causality_violations_total | Counter | peer | Effects arriving with HLC beyond causal horizon |
cluster_causal_horizon_ms | Gauge | peer | Causal horizon width per peer (ms) |
cluster_max_hlc_drift_ms | Gauge | peer | Maximum observed HLC drift |
| Metric | Type | Description |
|---|---|---|
cluster_writes_total | Counter | Local write effects emitted |
cluster_reads_total | Counter | Total reads served |
cluster_reads_local_total | Counter | Reads from local log/cache |
cluster_reads_remote_fetch_total | Counter | Reads requiring remote fetch |
cluster_notifications_sent_total | Counter | OffsetNotify messages broadcast |
cluster_notifications_received_total | Counter | OffsetNotify messages received |
cluster_fetches_served_total | Counter | Fetch RPCs served to peers |
cluster_binds_emitted_total | Counter | Bind effects (concurrent writes detected) |
cluster_snapshots_emitted_total | Counter | Snapshot effects (bind resolution) |
| Metric | Type | Description |
|---|---|---|
cluster_segment_active_bytes | Gauge | Bytes in current live segment |
cluster_segment_active_slots | Gauge | Slots in current segment (out of 1M) |
cluster_segments_sealed_total | Gauge | Sealed segments on this node |
cluster_disk_used_bytes | Gauge | Total disk used by log segments |
cluster_disk_capacity_bytes | Gauge | Total disk capacity |
cluster_disk_usage_ratio | Gauge | Disk usage ratio (used / capacity) |
| Metric | Type | Labels | Description |
|---|---|---|---|
cluster_peer_connected | Gauge | peer | 1 if QUIC stream up, 0 if down |
cluster_peer_reconnects_total | Counter | peer | Reconnection count |
cluster_peer_notifications_dropped_total | Counter | peer | Notifications dropped (buffer full or disconnected) |
cluster_peer_symmetric | Gauge | peer | 1 if path symmetric, 0 if asymmetric |
cluster_peer_alive | Gauge | peer | 1 if alive (heartbeat within timeout), 0 if dead |
cluster_peer_rtt_ms | Gauge | peer | Estimated RTT from heartbeat (ms) |
| Metric | Type | Labels | Description |
|---|---|---|---|
cluster_heartbeats_sent_total | Counter | Heartbeat packets sent | |
cluster_heartbeats_received_total | Counter | Heartbeat packets received | |
cluster_udp_notify_ack_latency_ms | Histogram | UDP notification ACK latency (ms) | |
cluster_retransmission_giveups_total | Counter | peer | Retransmission failures (max retries exhausted) |
cluster_quic_streams_opened_total | Counter | QUIC uni-streams opened | |
cluster_quic_stream_errors_total | Counter | QUIC stream errors |
Enable distributed tracing with OTLP HTTP export:
swytch redis --otel-endpoint=localhost:4318
# Use HTTP instead of HTTPS
swytch redis --otel-endpoint=localhost:4318 --otel-insecure
When enabled:
- Traces are exported via OTLP HTTP to the configured endpoint
- The service name is
swytch - Trace context is propagated in binary format across cluster nodes
trace_idandspan_idare injected into structured log records automatically
When --otel-endpoint is not set, tracing is completely disabled with zero overhead.
When both tracing and JSON logging are enabled, every log line includes trace_id and span_id fields:
swytch redis --otel-endpoint=localhost:4318 --log-format=json
{
"time": "2026-04-15T10:30:00Z",
"level": "INFO",
"msg": "command processed",
"trace_id": "abc123...",
"span_id": "def456..."
}
This allows correlating traces in Jaeger/Tempo with log entries in your log aggregator.
Swytch uses Go’s slog for structured logging:
# Text format (default, human-readable)
swytch redis -v
# JSON format (machine-parseable)
swytch redis --log-format=json -v
| Flag | Level | What’s logged |
|---|---|---|
| (none) | INFO | Startup, shutdown, errors |
-v | DEBUG | Detailed operational info |
--debug | DEBUG | All commands processed (very verbose) |
Cache hit rate:
swytch_redis_cache_hit_rate
Hit rate computed from counters (more accurate over time windows):
rate(swytch_redis_cache_hits_total[5m])
/ (rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))
Memory pressure:
swytch_redis_memory_bytes / swytch_redis_memory_max_bytes
Eviction rate:
rate(swytch_redis_evictions_total[5m])
Command latency p99:
swytch_redis_latency_seconds{operation="cmd", quantile="0.99"}
Commands per second by type:
rate(swytch_redis_commands_total[1m])
Cluster: peer connectivity:
cluster_peer_alive
Cluster: replication lag:
cluster_lsl_ms + cluster_protocol_overhead_ms
Cluster: remote fetch ratio (higher = more cross-node reads):
rate(cluster_reads_remote_fetch_total[5m])
/ (rate(cluster_reads_local_total[5m]) + rate(cluster_reads_remote_fetch_total[5m]))
groups:
- name: swytch
rules:
- alert: SwytchMemoryPressure
expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Swytch memory usage above 90%"
- alert: SwytchHighEvictionRate
expr: rate(swytch_redis_evictions_total[5m]) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "High eviction rate indicates memory pressure"
- alert: SwytchLowHitRate
expr: swytch_redis_cache_hit_rate < 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "Cache hit rate below 80%"
- alert: SwytchPeerDown
expr: cluster_peer_alive == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Cluster peer unreachable"
- alert: SwytchCausalityViolation
expr: rate(cluster_causality_violations_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Causality violations detected — possible clock divergence"
The standard Redis INFO command also reports server statistics:
redis-cli INFO
redis-cli INFO server
redis-cli INFO memory
redis-cli INFO stats