Skip to content

Benchmarks

The right comparison for FerroCache is Redis, not GPTCache.

GPTCache is a Python library — it runs inside your process and its "latency" is a function call, not a network call. FerroCache is a service — like Redis, it has a network boundary by design, which is what lets it be shared across your entire fleet.

Single-operation latency

Measured on Apple M4 Pro, release build, 384-dim unit vectors, 1k pre-populated entries where applicable. Reproduce with cargo bench.

Benchmark p50 p95 p99
Index insert (in-memory only) 21.6 µs
Query hit (HTTP round-trip) 0.44ms 0.51ms 0.54ms
Query miss (HTTP round-trip) 0.42ms 0.48ms 0.50ms
Insert (includes WAL fsync) 7.95ms 8.36ms 8.71ms
Exact-match pre-filter 0.38ms

Throughput under concurrency

Apple Silicon, cargo run --release, 384-dim embeddings, 5s per cell. Reproduce with make bench-concurrent.

The pre-M20 path took the WAL mutex per insert → fsync(2) serialised every writer. Group-commit coalesces concurrent inserts into a single batched write + one fsync.

Insert throughput (per-insert fsync vs default group-commit):

Mode 1 client 10 clients 50 clients 100 clients
Per-insert fsync 169/s 183/s 167/s 183/s
Group-commit (256) 122/s 682/s 1057/s 1900/s
Speedup 0.7× 3.7× 6.3× 10.4×

p99 insert latency at concurrency 100: 675ms (no group-commit) → 70ms (group-commit).

Query throughput (read path — no fsync):

Workload 1 client 10 clients 50 clients 100 clients
Query (hit) 1568/s 3529/s 3513/s 3491/s
Query (miss) 2632/s 3498/s 3513/s 3579/s

Eviction overhead (concurrency 100, 5K entry cap forcing rebuild on every batch):

Setting Insert ops/s p99 insert
max_entries_per_namespace = None 1900/s 70ms
max_entries_per_namespace = 5000 1313/s 1824ms

LRU eviction adds ~30% throughput overhead under sustained pressure. The p99 spike comes from periodic HNSW rebuilds (20% ghost-ratio threshold) running under the index write lock — a known trade-off of inline rebuild for graph quality. See Eviction & TTL.

Comparison vs GPTCache

Same workload (200 inserts, 600 expected-hit queries, 50 unrelated; 384-dim embeddings via sentence-transformers all-MiniLM-L6-v2; threshold 0.90; Apple Silicon). Reproduce with make benchmark-vs-gptcache.

Metric FerroCache GPTCache Notes
Hit rate (threshold 0.90) 99.8% N/A † Same embedding model
False hits on unrelated 0 N/A †
Query latency p50 0.44ms N/A † FerroCache: HTTP round-trip
Query latency p99 0.84ms N/A †
Insert latency p50 8.05ms N/A † FerroCache: WAL fsync
Insert latency p99 10.88ms N/A †
RSS after 200-entry seed 14.1 MB N/A † Resident set size
Insert throughput (concurrency 50) 2476/s N/A GPTCache is single-threaded

† GPTCache requires faiss-cpu + onnxruntime, which currently lack wheels for Python 3.13+. The benchmark harness runs GPTCache in a child process so a native segfault doesn't take down the parent — on this machine (Python 3.14) the child SIGSEGVs at faiss init. The script reports N/A and continues; on Python ≤ 3.12 it produces real comparison numbers.

Feature comparison vs GPTCache

FerroCache GPTCache
Architecture Service (HTTP) Library (in-process)
Multi-node cluster
Shared across fleet ❌ (per-process)
WAL durability ✅ (fsync) ❌ (in-memory)
Survives app restart
Tenant isolation cache_scope
Conversation scoping
Exact-match pre-filter
TTL per entry ⚠️ partial
LRU eviction
Any language client ❌ (Python only)
Prometheus metrics

Reproducing benchmarks

# Criterion microbenchmarks
cargo bench

# Concurrent HTTP throughput
make bench-concurrent

# vs GPTCache (requires Python ≤ 3.12 for working faiss/onnxruntime)
make benchmark-vs-gptcache

# 44-assertion cluster integration suite (sanity, not throughput)
make cluster-test