Observability
FerroCache exposes a hand-written Prometheus /metrics endpoint (no prometheus crate dependency) and ships a Grafana dashboard.
Scraping
# prometheus.yml
scrape_configs:
- job_name: ferrocache
metrics_path: /metrics
static_configs:
- targets:
- ferrocache-1:3000
- ferrocache-2:3000
- ferrocache-3:3000
/metrics and /health stay open even with auth enabled — no bearer token required.
Counters
| Metric | Meaning |
|---|---|
ferrocache_queries_total | All /query requests received |
ferrocache_hits_total | Queries that returned a cached entry |
ferrocache_misses_total | Queries that fell through |
ferrocache_exact_match_hits_total | Hits via the M27 pre-filter |
ferrocache_inserts_total | All /insert requests received |
ferrocache_evictions_total | LRU evictions |
ferrocache_expirations_total | TTL expiries |
ferrocache_deletions_total | Explicit DELETE /entry/:uuid |
ferrocache_invalidations_total | Radius invalidations |
ferrocache_index_rebuilds_total | HNSW rebuilds (ghost-ratio threshold) |
ferrocache_replication_retries_total | Inter-node forward retries |
ferrocache_replication_failures_total | Inter-node forward final failures |
ferrocache_read_repair_inserts_total | Background read-repair re-inserts |
Each counter exists top-level and per-namespace (label namespace="...").
Gauges
| Metric | Meaning |
|---|---|
ferrocache_entry_count | Total entries across namespaces |
ferrocache_namespace_entry_count{namespace="..."} | Entries per namespace |
ferrocache_peer_phi{peer="..."} | Phi-accrual score per peer |
ferrocache_peers_alive | Count of Alive peers |
ferrocache_peers_suspected | Count of Suspected peers |
ferrocache_peers_dead | Count of Dead peers |
ferrocache_ring_size | Ring entries (node_count × virtual_nodes) |
Histograms
ferrocache_query_latency_seconds and ferrocache_insert_latency_seconds are exposed as 16-bucket fixed-bucket histograms covering 100µs → 10s.
What to alert on
| Alert | Condition | Why it matters |
|---|---|---|
| Hit rate dropped | rate(hits[5m]) / rate(queries[5m]) < 0.5 for 10m | Cache thresholds may be too strict, or workload shifted |
| Replication failures | rate(replication_failures_total[5m]) > 0 | Cluster degraded — entries may be under-replicated |
| Dead peers | ferrocache_peers_dead > 0 | A node is offline; replication_factor - 1 failures away from data loss |
| Eviction storm | rate(evictions_total[1m]) > 100 | Working set exceeds max_entries_per_namespace; raise the cap or shed load |
| Suspected stuck | ferrocache_peers_suspected > 0 for 5m | Network blip didn't clear; investigate |
Grafana dashboard
The repo's monitoring/ overlay starts Prometheus + Grafana alongside a 3-node FerroCache cluster:
docker compose -f docker-compose.yml -f monitoring/compose.overlay.yml up -d
# Grafana at http://localhost:3030 (admin/admin)
The dashboard has 8 panels:
- Hit rate (last 5m)
- Query throughput (queries/s, hits/s, misses/s)
- Insert throughput
- Per-namespace entry counts
- Cluster: peer phi values
- Cluster: alive / suspected / dead counts
- Eviction + expiration rates
- Replication: retries vs failures
Edit monitoring/grafana/dashboards/ferrocache.json to customize.
Per-namespace insight
/admin/entry-stats returns the top-10 most-accessed entries per namespace:
{
"namespaces": {
"all-MiniLM-L6-v2::384": [
{ "uuid": "...", "access_count": 1284, "last_accessed_at": 1714857600 },
{ "uuid": "...", "access_count": 942, "last_accessed_at": 1714857512 }
]
}
}
Use this to identify hot keys, validate that high-traffic queries are actually cached, and find candidates for explicit pinning (via TTL or per-entry warmup).
Logging
FerroCache uses tracing for structured logging. Configure via RUST_LOG:
Useful events to watch:
cluster: peer transition Alive → Suspected → Dead— failure detector firing.replication: degraded— write fan-out lost a replica; reportseffective_replicas.index: rebuilding namespace— ghost ratio crossed 20%.