Skip to main content
Swytch Documentation
Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Observability

Swytch provides Prometheus metrics, OpenTelemetry tracing, and structured logging for production observability.

Prometheus Metrics

Enable the metrics endpoint with --metrics-port:

swytch redis --metrics-port=9090

This exposes two HTTP endpoints:

EndpointDescription
/metricsPrometheus scrape endpoint (OpenMetrics format)
/healthReturns HTTP 200 with body ok

Redis-Level Metrics

These metrics are collected from the Redis server and cache engine:

MetricTypeLabelsDescription
swytch_redis_connections_currentGaugeCurrent client connections
swytch_redis_connections_totalCounterTotal connections since start
swytch_redis_commands_totalCountercommandCommands processed by type
swytch_redis_cache_hits_totalCounterCache hits
swytch_redis_cache_misses_totalCounterCache misses
swytch_redis_cache_hit_rateGaugeHit rate (0-1)
swytch_redis_evictions_totalCounterCache evictions
swytch_redis_items_countGaugeItems in cache
swytch_redis_memory_bytesGaugeMemory used by cache (bytes)
swytch_redis_memory_max_bytesGaugeConfigured max memory (bytes)
swytch_redis_latency_secondsGaugeoperation, quantileLatency p50/p99 for get, set, cmd
swytch_redis_adaptive_k_thresholdGaugeshardPer-shard eviction protection threshold
swytch_redis_bytes_read_totalCounterNetwork bytes read
swytch_redis_bytes_written_totalCounterNetwork bytes written
swytch_redis_uptime_secondsGaugeServer uptime
swytch_redis_blocked_clientsGaugeClients blocked on BLPOP, BRPOP, etc.
swytch_redis_command_errors_totalCountercommandCommand errors by type

The latency_seconds metric reports six time series:

operationquantileMeaning
get0.5GET p50 latency
get0.99GET p99 latency
set0.5SET p50 latency
set0.99SET p99 latency
cmd0.5All commands p50 latency
cmd0.99All commands p99 latency

Cluster Metrics

When running in cluster mode, additional metrics are exposed via the default Prometheus registry:

Replication Latency

MetricTypeLabelsDescription
cluster_lsl_msGaugepeerLight-speed latency: minimum observed HLC delta (ms)
cluster_protocol_overhead_msGaugepeerExtra latency beyond LSL

Causality

MetricTypeLabelsDescription
cluster_causality_violations_totalCounterpeerEffects arriving with HLC beyond causal horizon
cluster_causal_horizon_msGaugepeerCausal horizon width per peer (ms)
cluster_max_hlc_drift_msGaugepeerMaximum observed HLC drift

Throughput

MetricTypeDescription
cluster_writes_totalCounterLocal write effects emitted
cluster_reads_totalCounterTotal reads served
cluster_reads_local_totalCounterReads from local log/cache
cluster_reads_remote_fetch_totalCounterReads requiring remote fetch
cluster_notifications_sent_totalCounterOffsetNotify messages broadcast
cluster_notifications_received_totalCounterOffsetNotify messages received
cluster_fetches_served_totalCounterFetch RPCs served to peers
cluster_binds_emitted_totalCounterBind effects (concurrent writes detected)
cluster_snapshots_emitted_totalCounterSnapshot effects (bind resolution)

Disk

MetricTypeDescription
cluster_segment_active_bytesGaugeBytes in current live segment
cluster_segment_active_slotsGaugeSlots in current segment (out of 1M)
cluster_segments_sealed_totalGaugeSealed segments on this node
cluster_disk_used_bytesGaugeTotal disk used by log segments
cluster_disk_capacity_bytesGaugeTotal disk capacity
cluster_disk_usage_ratioGaugeDisk usage ratio (used / capacity)

Peer Health

MetricTypeLabelsDescription
cluster_peer_connectedGaugepeer1 if QUIC stream up, 0 if down
cluster_peer_reconnects_totalCounterpeerReconnection count
cluster_peer_notifications_dropped_totalCounterpeerNotifications dropped (buffer full or disconnected)
cluster_peer_symmetricGaugepeer1 if path symmetric, 0 if asymmetric
cluster_peer_aliveGaugepeer1 if alive (heartbeat within timeout), 0 if dead
cluster_peer_rtt_msGaugepeerEstimated RTT from heartbeat (ms)

Heartbeat and Transport

MetricTypeLabelsDescription
cluster_heartbeats_sent_totalCounterHeartbeat packets sent
cluster_heartbeats_received_totalCounterHeartbeat packets received
cluster_udp_notify_ack_latency_msHistogramUDP notification ACK latency (ms)
cluster_retransmission_giveups_totalCounterpeerRetransmission failures (max retries exhausted)
cluster_quic_streams_opened_totalCounterQUIC uni-streams opened
cluster_quic_stream_errors_totalCounterQUIC stream errors

OpenTelemetry Tracing

Enable distributed tracing with OTLP HTTP export:

swytch redis --otel-endpoint=localhost:4318

# Use HTTP instead of HTTPS
swytch redis --otel-endpoint=localhost:4318 --otel-insecure

When enabled:

  • Traces are exported via OTLP HTTP to the configured endpoint
  • The service name is swytch
  • Trace context is propagated in binary format across cluster nodes
  • trace_id and span_id are injected into structured log records automatically

When --otel-endpoint is not set, tracing is completely disabled with zero overhead.

Trace-Log Correlation

When both tracing and JSON logging are enabled, every log line includes trace_id and span_id fields:

swytch redis --otel-endpoint=localhost:4318 --log-format=json
{
    "time": "2026-04-15T10:30:00Z",
    "level": "INFO",
    "msg": "command processed",
    "trace_id": "abc123...",
    "span_id": "def456..."
}

This allows correlating traces in Jaeger/Tempo with log entries in your log aggregator.

Structured Logging

Swytch uses Go’s slog for structured logging:

# Text format (default, human-readable)
swytch redis -v

# JSON format (machine-parseable)
swytch redis --log-format=json -v

Log Levels

FlagLevelWhat’s logged
(none)INFOStartup, shutdown, errors
-vDEBUGDetailed operational info
--debugDEBUGAll commands processed (very verbose)

Example Queries

Grafana / PromQL

Cache hit rate:

swytch_redis_cache_hit_rate

Hit rate computed from counters (more accurate over time windows):

rate(swytch_redis_cache_hits_total[5m])
/ (rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))

Memory pressure:

swytch_redis_memory_bytes / swytch_redis_memory_max_bytes

Eviction rate:

rate(swytch_redis_evictions_total[5m])

Command latency p99:

swytch_redis_latency_seconds{operation="cmd", quantile="0.99"}

Commands per second by type:

rate(swytch_redis_commands_total[1m])

Cluster: peer connectivity:

cluster_peer_alive

Cluster: replication lag:

cluster_lsl_ms + cluster_protocol_overhead_ms

Cluster: remote fetch ratio (higher = more cross-node reads):

rate(cluster_reads_remote_fetch_total[5m])
/ (rate(cluster_reads_local_total[5m]) + rate(cluster_reads_remote_fetch_total[5m]))

Prometheus Alerting Rules

groups:
  - name: swytch
    rules:
      - alert: SwytchMemoryPressure
        expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Swytch memory usage above 90%"

      - alert: SwytchHighEvictionRate
        expr: rate(swytch_redis_evictions_total[5m]) > 1000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High eviction rate indicates memory pressure"

      - alert: SwytchLowHitRate
        expr: swytch_redis_cache_hit_rate < 0.8
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 80%"

      - alert: SwytchPeerDown
        expr: cluster_peer_alive == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cluster peer unreachable"

      - alert: SwytchCausalityViolation
        expr: rate(cluster_causality_violations_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Causality violations detected — possible clock divergence"

INFO Command

The standard Redis INFO command also reports server statistics:

redis-cli INFO
redis-cli INFO server
redis-cli INFO memory
redis-cli INFO stats