Observability

This guide covers monitoring Swytch in production, including all available metrics, recommended alerts, and dashboards.

Enabling Metrics

Prometheus metrics are exposed via HTTP:

# Redis mode
swytch redis --metrics-port 9090

# Memcached mode
swytch memcached --metrics-port 9090

Scrape metrics from http://localhost:9090/metrics.

Metrics Reference

Cache Metrics

Metric	Type	Description
`swytch_redis_cache_hits_total`	Counter	L1 (memory) cache hits
`swytch_redis_cache_misses_total`	Counter	L1 cache misses
`swytch_redis_evictions_total`	Counter	Keys evicted from L1 cache
`swytch_redis_memory_bytes`	Gauge	Current memory usage
`swytch_redis_memory_max_bytes`	Gauge	Configured memory limit (`--maxmemory`)
`swytch_redis_items_count`	Gauge	Total number of items stored

Tiered Storage Metrics (Persistent Mode)

Metric	Type	Description
`swytch_redis_l2_hits_total`	Counter	L2 (disk) cache hits
`swytch_redis_l2_misses_total`	Counter	L2 cache misses (key doesn’t exist)
`swytch_redis_l2_writes_total`	Counter	Writes to L2 storage

Command Metrics

Metric	Type	Labels	Description
`swytch_redis_commands_total`	Counter	`command`	Commands processed by type
`swytch_redis_latency_seconds`	Histogram	`command`	Command latency distribution
`swytch_redis_command_errors_total`	Counter	`command`, `error`	Command errors by type

Connection Metrics

Metric	Type	Description
`swytch_redis_connections_total`	Counter	Total connections accepted
`swytch_redis_connections_current`	Gauge	Current active connections

Go Runtime Metrics

Standard Go metrics are also exposed:

Metric	Type	Description
`go_goroutines`	Gauge	Number of goroutines
`go_memstats_alloc_bytes`	Gauge	Bytes allocated and in use
`go_memstats_heap_inuse_bytes`	Gauge	Heap memory in use
`go_gc_duration_seconds`	Summary	GC pause duration

Key Performance Indicators

Hit Rate

L1 (Memory) Hit Rate:

rate(swytch_redis_cache_hits_total[5m]) /
(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))

Overall Hit Rate (with L2):

(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_l2_hits_total[5m])) /
(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_l2_hits_total[5m]) + rate(swytch_redis_l2_misses_total[5m]))

Target: >95% for cache workloads, >99% for session stores.

Memory Utilisation

swytch_redis_memory_bytes / swytch_redis_memory_max_bytes

Target: 70–90%. Below 70% means over-provisioned; above 90% risks eviction pressure.

Eviction Rate

rate(swytch_redis_evictions_total[5m])

Target: Near zero for database workloads. Some eviction is normal for cache workloads.

Command Throughput

sum(rate(swytch_redis_commands_total[5m]))

Command Latency

p50:

histogram_quantile(0.5, rate(swytch_redis_latency_seconds_bucket[5m]))

p99:

histogram_quantile(0.99, rate(swytch_redis_latency_seconds_bucket[5m]))

Target: p50 < 1ms, p99 < 5ms for in-memory; p99 < 10ms for tiered.

Recommended Alerts

Critical Alerts

groups:
  - name: swytch-critical
    rules:
      - alert: SwytchDown
        expr: up{job="swytch"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Swytch instance is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute."

      - alert: SwytchOutOfMemory
        expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Swytch memory usage critical"
          description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}."

      - alert: SwytchHighErrorRate
        expr: |
          sum(rate(swytch_redis_command_errors_total[5m])) /
          sum(rate(swytch_redis_commands_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in Swytch"
          description: "Error rate is {{ $value | humanizePercentage }}."

Warning Alerts

      - alert: SwytchMemoryPressure
        expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Swytch memory usage high"
          description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}."

      - alert: SwytchHighEvictionRate
        expr: rate(swytch_redis_evictions_total[5m]) > 1000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High eviction rate"
          description: "Eviction rate is {{ $value }}/sec on {{ $labels.instance }}."

      - alert: SwytchLowHitRate
        expr: |
          rate(swytch_redis_cache_hits_total[5m]) /
          (rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]) + 0.001) < 0.8
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate"
          description: "Hit rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}."

      - alert: SwytchHighLatency
        expr: |
          histogram_quantile(0.99, rate(swytch_redis_latency_seconds_bucket[5m])) > 0.01
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High p99 latency"
          description: "p99 latency is {{ $value | humanizeDuration }} on {{ $labels.instance }}."

      - alert: SwytchConnectionsHigh
        expr: swytch_redis_connections_current > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection count"
          description: "{{ $value }} active connections on {{ $labels.instance }}."

Info Alerts

      - alert: SwytchRestarted
        expr: changes(process_start_time_seconds{job="swytch"}[10m]) > 0
        labels:
          severity: info
        annotations:
          summary: "Swytch instance restarted"
          description: "{{ $labels.instance }} has restarted."

Grafana Dashboard

Overview Panel

{
    "title": "Swytch Overview",
    "panels": [
        {
            "title": "Hit Rate",
            "type": "gauge",
            "targets": [
                {
                    "expr": "rate(swytch_redis_cache_hits_total[5m]) / (rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))"
                }
            ]
        },
        {
            "title": "Memory Usage",
            "type": "gauge",
            "targets": [
                {
                    "expr": "swytch_redis_memory_bytes / swytch_redis_memory_max_bytes"
                }
            ]
        },
        {
            "title": "Throughput",
            "type": "stat",
            "targets": [
                {
                    "expr": "sum(rate(swytch_redis_commands_total[5m]))"
                }
            ]
        },
        {
            "title": "Connections",
            "type": "stat",
            "targets": [
                {
                    "expr": "swytch_redis_connections_current"
                }
            ]
        }
    ]
}

Key Graphs

Commands Over Time:

sum by (command) (rate(swytch_redis_commands_total[5m]))

Latency Heatmap:

sum(rate(swytch_redis_latency_seconds_bucket[1m])) by (le)

Memory and Evictions:

# Left axis
swytch_redis_memory_bytes

# Right axis
rate(swytch_redis_evictions_total[5m])

L1 vs L2 Traffic (Tiered Mode):

rate(swytch_redis_cache_hits_total[5m])    # L1 hits
rate(swytch_redis_l2_hits_total[5m])       # L2 hits
rate(swytch_redis_l2_misses_total[5m])     # Total misses

Redis INFO Command

The standard INFO command also provides statistics:

redis-cli INFO

Key sections:

# Server
redis_version:8.4.0-swytch
uptime_in_seconds:86400

# Memory
used_memory:1073741824
maxmemory:4294967296

# Stats
total_commands_processed:1234567890
keyspace_hits:1000000000
keyspace_misses:50000000

Note: Tiered storage statistics (L2 hits/misses/writes) are available via Prometheus metrics, not the INFO command.

Logging

Log Levels

Control verbosity with -v flags:

swytch redis                    # Normal (errors and startup)
swytch redis -v                 # Verbose (warnings)
swytch redis --debug            # Debug (all commands logged)

Log Format

Logs are written to stderr in a structured format:

2024/01/15 10:00:00 redis server listening on 127.0.0.1:6379
2024/01/15 10:00:05 client connected from 192.168.1.100:45678
2024/01/15 10:00:10 WARNING: memory usage at 85%

Log Aggregation

For production, pipe logs to your aggregation system:

# Systemd captures stdout/stderr automatically
journalctl -u swytch -f

# Docker
docker logs -f swytch-redis

# Kubernetes
kubectl logs -f deployment/swytch

Tracing

Swytch does not currently support distributed tracing (OpenTelemetry/Jaeger). Monitor at the application level using your existing tracing infrastructure around Redis client calls.