Observability
This guide covers monitoring Swytch in production, including all available metrics, recommended alerts, and dashboards.
Prometheus metrics are exposed via HTTP:
# Redis mode
swytch redis --metrics-port 9090
# Memcached mode
swytch memcached --metrics-port 9090
Scrape metrics from http://localhost:9090/metrics.
| Metric | Type | Description |
|---|---|---|
swytch_redis_cache_hits_total | Counter | L1 (memory) cache hits |
swytch_redis_cache_misses_total | Counter | L1 cache misses |
swytch_redis_evictions_total | Counter | Keys evicted from L1 cache |
swytch_redis_memory_bytes | Gauge | Current memory usage |
swytch_redis_memory_max_bytes | Gauge | Configured memory limit (--maxmemory) |
swytch_redis_items_count | Gauge | Total number of items stored |
| Metric | Type | Description |
|---|---|---|
swytch_redis_l2_hits_total | Counter | L2 (disk) cache hits |
swytch_redis_l2_misses_total | Counter | L2 cache misses (key doesn’t exist) |
swytch_redis_l2_writes_total | Counter | Writes to L2 storage |
| Metric | Type | Labels | Description |
|---|---|---|---|
swytch_redis_commands_total | Counter | command | Commands processed by type |
swytch_redis_latency_seconds | Histogram | command | Command latency distribution |
swytch_redis_command_errors_total | Counter | command, error | Command errors by type |
| Metric | Type | Description |
|---|---|---|
swytch_redis_connections_total | Counter | Total connections accepted |
swytch_redis_connections_current | Gauge | Current active connections |
Standard Go metrics are also exposed:
| Metric | Type | Description |
|---|---|---|
go_goroutines | Gauge | Number of goroutines |
go_memstats_alloc_bytes | Gauge | Bytes allocated and in use |
go_memstats_heap_inuse_bytes | Gauge | Heap memory in use |
go_gc_duration_seconds | Summary | GC pause duration |
L1 (Memory) Hit Rate:
rate(swytch_redis_cache_hits_total[5m]) /
(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))
Overall Hit Rate (with L2):
(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_l2_hits_total[5m])) /
(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_l2_hits_total[5m]) + rate(swytch_redis_l2_misses_total[5m]))
Target: >95% for cache workloads, >99% for session stores.
swytch_redis_memory_bytes / swytch_redis_memory_max_bytes
Target: 70–90%. Below 70% means over-provisioned; above 90% risks eviction pressure.
rate(swytch_redis_evictions_total[5m])
Target: Near zero for database workloads. Some eviction is normal for cache workloads.
sum(rate(swytch_redis_commands_total[5m]))
p50:
histogram_quantile(0.5, rate(swytch_redis_latency_seconds_bucket[5m]))
p99:
histogram_quantile(0.99, rate(swytch_redis_latency_seconds_bucket[5m]))
Target: p50 < 1ms, p99 < 5ms for in-memory; p99 < 10ms for tiered.
groups:
- name: swytch-critical
rules:
- alert: SwytchDown
expr: up{job="swytch"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Swytch instance is down"
description: "{{ $labels.instance }} has been down for more than 1 minute."
- alert: SwytchOutOfMemory
expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.95
for: 5m
labels:
severity: critical
annotations:
summary: "Swytch memory usage critical"
description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}."
- alert: SwytchHighErrorRate
expr: |
sum(rate(swytch_redis_command_errors_total[5m])) /
sum(rate(swytch_redis_commands_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate in Swytch"
description: "Error rate is {{ $value | humanizePercentage }}."
- alert: SwytchMemoryPressure
expr: swytch_redis_memory_bytes / swytch_redis_memory_max_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Swytch memory usage high"
description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}."
- alert: SwytchHighEvictionRate
expr: rate(swytch_redis_evictions_total[5m]) > 1000
for: 15m
labels:
severity: warning
annotations:
summary: "High eviction rate"
description: "Eviction rate is {{ $value }}/sec on {{ $labels.instance }}."
- alert: SwytchLowHitRate
expr: |
rate(swytch_redis_cache_hits_total[5m]) /
(rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]) + 0.001) < 0.8
for: 30m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Hit rate is {{ $value | humanizePercentage }} on {{ $labels.instance }}."
- alert: SwytchHighLatency
expr: |
histogram_quantile(0.99, rate(swytch_redis_latency_seconds_bucket[5m])) > 0.01
for: 10m
labels:
severity: warning
annotations:
summary: "High p99 latency"
description: "p99 latency is {{ $value | humanizeDuration }} on {{ $labels.instance }}."
- alert: SwytchConnectionsHigh
expr: swytch_redis_connections_current > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count"
description: "{{ $value }} active connections on {{ $labels.instance }}."
- alert: SwytchRestarted
expr: changes(process_start_time_seconds{job="swytch"}[10m]) > 0
labels:
severity: info
annotations:
summary: "Swytch instance restarted"
description: "{{ $labels.instance }} has restarted."
{
"title": "Swytch Overview",
"panels": [
{
"title": "Hit Rate",
"type": "gauge",
"targets": [
{
"expr": "rate(swytch_redis_cache_hits_total[5m]) / (rate(swytch_redis_cache_hits_total[5m]) + rate(swytch_redis_cache_misses_total[5m]))"
}
]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [
{
"expr": "swytch_redis_memory_bytes / swytch_redis_memory_max_bytes"
}
]
},
{
"title": "Throughput",
"type": "stat",
"targets": [
{
"expr": "sum(rate(swytch_redis_commands_total[5m]))"
}
]
},
{
"title": "Connections",
"type": "stat",
"targets": [
{
"expr": "swytch_redis_connections_current"
}
]
}
]
}
Commands Over Time:
sum by (command) (rate(swytch_redis_commands_total[5m]))
Latency Heatmap:
sum(rate(swytch_redis_latency_seconds_bucket[1m])) by (le)
Memory and Evictions:
# Left axis
swytch_redis_memory_bytes
# Right axis
rate(swytch_redis_evictions_total[5m])
L1 vs L2 Traffic (Tiered Mode):
rate(swytch_redis_cache_hits_total[5m]) # L1 hits
rate(swytch_redis_l2_hits_total[5m]) # L2 hits
rate(swytch_redis_l2_misses_total[5m]) # Total misses
The standard INFO command also provides statistics:
redis-cli INFO
Key sections:
# Server
redis_version:8.4.0-swytch
uptime_in_seconds:86400
# Memory
used_memory:1073741824
maxmemory:4294967296
# Stats
total_commands_processed:1234567890
keyspace_hits:1000000000
keyspace_misses:50000000
Note: Tiered storage statistics (L2 hits/misses/writes) are available via Prometheus metrics, not the INFO command.
Control verbosity with -v flags:
swytch redis # Normal (errors and startup)
swytch redis -v # Verbose (warnings)
swytch redis --debug # Debug (all commands logged)
Logs are written to stderr in a structured format:
2024/01/15 10:00:00 redis server listening on 127.0.0.1:6379
2024/01/15 10:00:05 client connected from 192.168.1.100:45678
2024/01/15 10:00:10 WARNING: memory usage at 85%
For production, pipe logs to your aggregation system:
# Systemd captures stdout/stderr automatically
journalctl -u swytch -f
# Docker
docker logs -f swytch-redis
# Kubernetes
kubectl logs -f deployment/swytch
Swytch does not currently support distributed tracing (OpenTelemetry/Jaeger). Monitor at the application level using your existing tracing infrastructure around Redis client calls.