Redis — Caching¶
OWNER: gerco (infrastructure) ALSO_USED_BY: urszula, maxim (application usage) LAST_VERIFIED: 2026-03-26 GE_STACK_VERSION: Redis 7.4
Overview¶
Redis serves as GE's ephemeral cache layer alongside its role as message broker. This page covers caching strategies, TTL management, invalidation, and stampede prevention. PostgreSQL is always the SSOT — Redis cache is a performance optimization, never a data store.
Cache Strategy Decision Tree¶
Is the data read-heavy with infrequent writes?
YES → Cache-Aside (lazy loading)
Is strong consistency required (write must be immediately visible)?
YES → Write-Through
Is the workload write-heavy with tolerable staleness?
YES → Write-Behind (write-back)
Is the data expensive to compute but rarely changes?
YES → Cache-Aside with long TTL + event-based invalidation
CHECK: Agent is adding caching to a feature. IF: Agent caches without defining an invalidation strategy. THEN: STOP. Every cache MUST have a defined invalidation path. Stale data is a bug.
Cache-Aside (Default Pattern)¶
Cache-aside is GE's default caching strategy for read-heavy data. The application manages both the cache and the database.
async def get_agent(agent_id: str) -> Agent:
# 1. Check cache
cached = await redis.get(f"agent:{agent_id}")
if cached:
return Agent.model_validate_json(cached)
# 2. Cache miss — fetch from PostgreSQL
agent = await db.fetch_agent(agent_id)
if agent is None:
# Negative cache — prevent repeated DB hits for missing data
await redis.set(f"agent:{agent_id}:null", "1", ex=60)
raise AgentNotFound(agent_id)
# 3. Store in cache with TTL
await redis.set(
f"agent:{agent_id}",
agent.model_dump_json(),
ex=300 # 5 minutes
)
return agent
Cache-Aside Invalidation¶
async def update_agent(agent_id: str, data: dict) -> Agent:
# 1. Update PostgreSQL (SSOT)
agent = await db.update_agent(agent_id, data)
# 2. Invalidate cache (delete, not update)
await redis.delete(f"agent:{agent_id}")
return agent
CHECK: Agent is invalidating cache. IF: Agent updates the cache value instead of deleting it. THEN: Delete the key. Let the next read repopulate. Update-on-write is error-prone (race conditions between concurrent writes).
Write-Through¶
Use write-through when the cached value MUST reflect the latest write immediately. GE uses this for session data and real-time agent status.
async def update_agent_status(agent_id: str, status: str) -> None:
# 1. Write to PostgreSQL
await db.update_agent_status(agent_id, status)
# 2. Write to cache (atomic with DB write intent)
await redis.set(
f"agent:{agent_id}:status",
status,
ex=600 # 10 minutes — safety TTL even for write-through
)
ANTI_PATTERN: Write-through without a TTL. FIX: Always set a TTL, even on write-through keys. If the invalidation path breaks, TTL is the safety net.
TTL Strategies¶
Default TTLs by Data Type¶
| Data Type | TTL | Rationale |
|---|---|---|
| Agent registry data | 300s (5 min) | Changes infrequently, agents read often |
| Agent status | 600s (10 min) | Write-through, safety TTL |
| Work item metadata | 120s (2 min) | Short-lived, frequently updated |
| API response cache | 60s (1 min) | External data, freshness matters |
| Session tokens | 3600s (1 hr) | Matches auth session lifetime |
| Negative cache (missing data) | 60s (1 min) | Re-check DB frequently |
| Rate limit counters | Varies | Matches window size (1s, 1min, 1hr) |
TTL Jitter¶
ANTI_PATTERN: Setting identical TTLs on many related keys. FIX: Add random jitter to prevent synchronized expiration (cache stampede).
import random
BASE_TTL = 300 # 5 minutes
JITTER = 30 # +/- 30 seconds
ttl = BASE_TTL + random.randint(-JITTER, JITTER)
await redis.set(f"agent:{agent_id}", data, ex=ttl)
Cache Stampede Prevention¶
A cache stampede occurs when a popular key expires and hundreds of concurrent requests all miss the cache simultaneously, overwhelming the database.
Technique 1: Mutex Lock (GE Default)¶
Only one request rebuilds the cache; others wait or get stale data.
async def get_with_mutex(key: str, fetch_fn, ttl: int = 300) -> Any:
# 1. Try cache
cached = await redis.get(key)
if cached:
return json.loads(cached)
# 2. Try to acquire rebuild lock
lock_key = f"lock:{key}"
acquired = await redis.set(lock_key, "1", nx=True, ex=10)
if acquired:
try:
# 3. Rebuild cache
data = await fetch_fn()
jitter = random.randint(-30, 30)
await redis.set(key, json.dumps(data), ex=ttl + jitter)
return data
finally:
await redis.delete(lock_key)
else:
# 4. Another process is rebuilding — wait briefly and retry
await asyncio.sleep(0.1)
cached = await redis.get(key)
if cached:
return json.loads(cached)
# Fallback to direct fetch if still no cache
return await fetch_fn()
Technique 2: Probabilistic Early Refresh (High-Traffic Keys)¶
For extremely high-traffic keys where even brief lock contention is unacceptable.
async def get_with_early_refresh(
key: str, fetch_fn, ttl: int = 300, beta: float = 1.0
) -> Any:
# Read value + remaining TTL
pipe = redis.pipeline()
pipe.get(key)
pipe.ttl(key)
cached, remaining_ttl = await pipe.execute()
if cached:
# Probabilistic early refresh: higher chance as TTL approaches 0
if remaining_ttl > 0:
random_threshold = remaining_ttl - beta * math.log(random.random())
if random_threshold > 0:
return json.loads(cached)
# Cache miss or early refresh triggered
data = await fetch_fn()
jitter = random.randint(-30, 30)
await redis.set(key, json.dumps(data), ex=ttl + jitter)
return data
Technique 3: Negative Caching¶
ANTI_PATTERN: Not caching "not found" results. FIX: Cache negative results with a short TTL to prevent stampedes on missing data.
async def get_agent(agent_id: str) -> Agent | None:
null_key = f"agent:{agent_id}:null"
if await redis.exists(null_key):
return None # Known missing — skip DB
cached = await redis.get(f"agent:{agent_id}")
if cached:
return Agent.model_validate_json(cached)
agent = await db.fetch_agent(agent_id)
if agent is None:
await redis.set(null_key, "1", ex=60) # Cache miss for 1 minute
return None
await redis.set(f"agent:{agent_id}", agent.model_dump_json(), ex=300)
return agent
Cache Key Naming¶
GE convention for cache keys:
{entity}:{identifier} → agent:martijn
{entity}:{identifier}:{field} → agent:martijn:status
{entity}:{identifier}:null → agent:nonexistent:null (negative cache)
{scope}:{entity}:{identifier} → cache:api:weather:amsterdam
lock:{key} → lock:agent:martijn (mutex)
dedup:{work_item_id} → dedup:abc123 (stream dedup)
ratelimit:{scope}:{identifier}:{window} → ratelimit:api:martijn:1min
ANTI_PATTERN: Using unstructured or inconsistent key names.
FIX: Follow the {entity}:{identifier}:{qualifier} pattern. Consistent naming enables monitoring and bulk operations with SCAN.
GE-Specific Conventions¶
- PostgreSQL is always SSOT — Cache is optimization only.
- Every cache key has a TTL — No permanent cache entries.
- TTL jitter on batch operations — Prevent synchronized expiration.
- Mutex lock for expensive rebuilds — Prevent stampede.
- Negative caching — Cache "not found" results to protect DB.
- Delete on invalidation — Never update cache directly on write.
- Max memory 256MB —
allkeys-lrueviction. Do not cache large payloads.
Cross-References¶
READ_ALSO: wiki/docs/stack/redis/index.md READ_ALSO: wiki/docs/stack/redis/streams.md READ_ALSO: wiki/docs/stack/redis/patterns.md READ_ALSO: wiki/docs/stack/redis/pitfalls.md