Observability

Swytch provides Prometheus metrics, OpenTelemetry tracing, and structured logging for production observability.

Prometheus Metrics

Enable the metrics endpoint with --metrics-port:

swytch redis --metrics-port=9090

This exposes two HTTP endpoints:

Endpoint	Description
`/metrics`	Prometheus scrape endpoint (OpenMetrics format)
`/health`	Returns HTTP 200 with body `ok`

Redis-Level Metrics

These metrics are collected from the Redis server and cache engine:

Metric	Type	Labels	Description
`swytch_redis_connections_current`	Gauge		Current client connections
`swytch_redis_connections_total`	Counter		Total connections since start
`swytch_redis_commands_total`	Counter	`command`	Commands processed by type
`swytch_redis_cache_hits_total`	Counter		Cache hits
`swytch_redis_cache_misses_total`	Counter		Cache misses
`swytch_redis_cache_hit_rate`	Gauge		Hit rate (0-1)
`swytch_redis_evictions_total`	Counter		Cache evictions
`swytch_redis_items_count`	Gauge		Items in cache
`swytch_redis_memory_bytes`	Gauge		Memory used by cache (bytes)
`swytch_redis_memory_max_bytes`	Gauge		Configured max memory (bytes)
`swytch_redis_latency_seconds`	Gauge	`operation`, `quantile`	Latency p50/p99 for get, set, cmd
`swytch_redis_adaptive_k_threshold`	Gauge	`shard`	Per-shard eviction protection threshold
`swytch_redis_bytes_read_total`	Counter		Network bytes read
`swytch_redis_bytes_written_total`	Counter		Network bytes written
`swytch_redis_uptime_seconds`	Gauge		Server uptime
`swytch_redis_blocked_clients`	Gauge		Clients blocked on BLPOP, BRPOP, etc.
`swytch_redis_command_errors_total`	Counter	`command`	Command errors by type

The latency_seconds metric reports six time series:

`operation`	`quantile`	Meaning
`get`	`0.5`	GET p50 latency
`get`	`0.99`	GET p99 latency
`set`	`0.5`	SET p50 latency
`set`	`0.99`	SET p99 latency
`cmd`	`0.5`	All commands p50 latency
`cmd`	`0.99`	All commands p99 latency

Cluster Metrics

When running in cluster mode, additional metrics are exposed via the default Prometheus registry:

Replication Latency

Metric	Type	Labels	Description
`cluster_lsl_ms`	Gauge	`peer`	Light-speed latency: minimum observed HLC delta (ms)
`cluster_protocol_overhead_ms`	Gauge	`peer`	Extra latency beyond LSL

Causality

Metric	Type	Labels	Description
`cluster_causality_violations_total`	Counter	`peer`	Effects arriving with HLC beyond causal horizon
`cluster_causal_horizon_ms`	Gauge	`peer`	Causal horizon width per peer (ms)
`cluster_max_hlc_drift_ms`	Gauge	`peer`	Maximum observed HLC drift

Throughput

Metric	Type	Description
`cluster_writes_total`	Counter	Local write effects emitted
`cluster_reads_total`	Counter	Total reads served
`cluster_reads_local_total`	Counter	Reads from local log/cache
`cluster_reads_remote_fetch_total`	Counter	Reads requiring remote fetch
`cluster_notifications_sent_total`	Counter	OffsetNotify messages broadcast
`cluster_notifications_received_total`	Counter	OffsetNotify messages received
`cluster_fetches_served_total`	Counter	Fetch RPCs served to peers
`cluster_binds_emitted_total`	Counter	Bind effects (concurrent writes detected)
`cluster_snapshots_emitted_total`	Counter	Snapshot effects (bind resolution)

Disk

Metric	Type	Description
`cluster_segment_active_bytes`	Gauge	Bytes in current live segment
`cluster_segment_active_slots`	Gauge	Slots in current segment (out of 1M)
`cluster_segments_sealed_total`	Gauge	Sealed segments on this node
`cluster_disk_used_bytes`	Gauge	Total disk used by log segments
`cluster_disk_capacity_bytes`	Gauge	Total disk capacity
`cluster_disk_usage_ratio`	Gauge	Disk usage ratio (used / capacity)

Peer Health

Metric	Type	Labels	Description
`cluster_peer_connected`	Gauge	`peer`	1 if QUIC stream up, 0 if down
`cluster_peer_reconnects_total`	Counter	`peer`	Reconnection count
`cluster_peer_notifications_dropped_total`	Counter	`peer`	Notifications dropped (buffer full or disconnected)
`cluster_peer_symmetric`	Gauge	`peer`	1 if path symmetric, 0 if asymmetric
`cluster_peer_alive`	Gauge	`peer`	1 if alive (heartbeat within timeout), 0 if dead
`cluster_peer_rtt_ms`	Gauge	`peer`	Estimated RTT from heartbeat (ms)

Heartbeat and Transport

Metric	Type	Labels	Description
`cluster_heartbeats_sent_total`	Counter		Heartbeat packets sent
`cluster_heartbeats_received_total`	Counter		Heartbeat packets received
`cluster_udp_notify_ack_latency_ms`	Histogram		UDP notification ACK latency (ms)
`cluster_retransmission_giveups_total`	Counter	`peer`	Retransmission failures (max retries exhausted)
`cluster_quic_streams_opened_total`	Counter		QUIC uni-streams opened
`cluster_quic_stream_errors_total`	Counter		QUIC stream errors

OpenTelemetry Tracing

Enable distributed tracing with OTLP HTTP export:

swytch redis --otel-endpoint=localhost:4318

# Use HTTP instead of HTTPS
swytch redis --otel-endpoint=localhost:4318 --otel-insecure

When enabled:

Traces are exported via OTLP HTTP to the configured endpoint
The service name is swytch
Trace context is propagated in binary format across cluster nodes
trace_id and span_id are injected into structured log records automatically

When --otel-endpoint is not set, tracing is completely disabled with zero overhead.

Trace-Log Correlation

When both tracing and JSON logging are enabled, every log line includes trace_id and span_id fields:

swytch redis --otel-endpoint=localhost:4318 --log-format=json

{
    "time": "2026-04-15T10:30:00Z",
    "level": "INFO",
    "msg": "command processed",
    "trace_id": "abc123...",
    "span_id": "def456..."
}

This allows correlating traces in Jaeger/Tempo with log entries in your log aggregator.

Structured Logging

Swytch uses Go’s slog for structured logging:

# Text format (default, human-readable)
swytch redis -v

# JSON format (machine-parseable)
swytch redis --log-format=json -v

Log Levels

Flag	Level	What’s logged
(none)	INFO	Startup, shutdown, errors
`-v`	DEBUG	Detailed operational info
`--debug`	DEBUG	All commands processed (very verbose)

Example Queries

Grafana / PromQL

Cache hit rate:

swytch_redis_cache_hit_rate

Hit rate computed from counters (more accurate over time windows):

rate(swytch_redis_cache_hits_total[5m])
/ (rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))

Memory pressure:

swytch_redis_memory_bytes / swytch_redis_memory_max_bytes

Eviction rate:

rate(swytch_redis_evictions_total[5m])

Command latency p99:

swytch_redis_latency_seconds{operation="cmd", quantile="0.99"}

Commands per second by type:

rate(swytch_redis_commands_total[1m])

Cluster: peer connectivity:

cluster_peer_alive

Cluster: replication lag:

cluster_lsl_ms + cluster_protocol_overhead_ms

Cluster: remote fetch ratio (higher = more cross-node reads):

rate(cluster_reads_remote_fetch_total[5m])
/ (rate(cluster_reads_local_total[5m]) + rate(cluster_reads_remote_fetch_total[5m]))

Prometheus Alerting Rules

groups:
  - name: swytch
    rules:
      - alert: SwytchMemoryPressure
        expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Swytch memory usage above 90%"

      - alert: SwytchHighEvictionRate
        expr: rate(swytch_redis_evictions_total[5m]) > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High eviction rate indicates memory pressure"

      - alert: SwytchLowHitRate
        expr: swytch_redis_cache_hit_rate < 0.8
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 80%"

      - alert: SwytchPeerDown
        expr: cluster_peer_alive == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cluster peer unreachable"

      - alert: SwytchCausalityViolation
        expr: rate(cluster_causality_violations_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Causality violations detected — possible clock divergence"

INFO Command

The standard Redis INFO command also reports server statistics:

redis-cli INFO
redis-cli INFO server
redis-cli INFO memory
redis-cli INFO stats