Redis — Caching¶

OWNER: gerco (infrastructure) ALSO_USED_BY: urszula, maxim (application usage) LAST_VERIFIED: 2026-03-26 GE_STACK_VERSION: Redis 7.4

Overview¶

Redis serves as GE's ephemeral cache layer alongside its role as message broker. This page covers caching strategies, TTL management, invalidation, and stampede prevention. PostgreSQL is always the SSOT — Redis cache is a performance optimization, never a data store.

Cache Strategy Decision Tree¶

Is the data read-heavy with infrequent writes?
  YES → Cache-Aside (lazy loading)

Is strong consistency required (write must be immediately visible)?
  YES → Write-Through

Is the workload write-heavy with tolerable staleness?
  YES → Write-Behind (write-back)

Is the data expensive to compute but rarely changes?
  YES → Cache-Aside with long TTL + event-based invalidation

CHECK: Agent is adding caching to a feature. IF: Agent caches without defining an invalidation strategy. THEN: STOP. Every cache MUST have a defined invalidation path. Stale data is a bug.

Cache-Aside (Default Pattern)¶

Cache-aside is GE's default caching strategy for read-heavy data. The application manages both the cache and the database.

async def get_agent(agent_id: str) -> Agent:
    # 1. Check cache
    cached = await redis.get(f"agent:{agent_id}")
    if cached:
        return Agent.model_validate_json(cached)

    # 2. Cache miss — fetch from PostgreSQL
    agent = await db.fetch_agent(agent_id)
    if agent is None:
        # Negative cache — prevent repeated DB hits for missing data
        await redis.set(f"agent:{agent_id}:null", "1", ex=60)
        raise AgentNotFound(agent_id)

    # 3. Store in cache with TTL
    await redis.set(
        f"agent:{agent_id}",
        agent.model_dump_json(),
        ex=300  # 5 minutes
    )
    return agent

Cache-Aside Invalidation¶

async def update_agent(agent_id: str, data: dict) -> Agent:
    # 1. Update PostgreSQL (SSOT)
    agent = await db.update_agent(agent_id, data)

    # 2. Invalidate cache (delete, not update)
    await redis.delete(f"agent:{agent_id}")

    return agent

CHECK: Agent is invalidating cache. IF: Agent updates the cache value instead of deleting it. THEN: Delete the key. Let the next read repopulate. Update-on-write is error-prone (race conditions between concurrent writes).

Write-Through¶

Use write-through when the cached value MUST reflect the latest write immediately. GE uses this for session data and real-time agent status.

async def update_agent_status(agent_id: str, status: str) -> None:
    # 1. Write to PostgreSQL
    await db.update_agent_status(agent_id, status)

    # 2. Write to cache (atomic with DB write intent)
    await redis.set(
        f"agent:{agent_id}:status",
        status,
        ex=600  # 10 minutes — safety TTL even for write-through
    )

ANTI_PATTERN: Write-through without a TTL. FIX: Always set a TTL, even on write-through keys. If the invalidation path breaks, TTL is the safety net.

TTL Strategies¶

Default TTLs by Data Type¶

Data Type	TTL	Rationale
Agent registry data	300s (5 min)	Changes infrequently, agents read often
Agent status	600s (10 min)	Write-through, safety TTL
Work item metadata	120s (2 min)	Short-lived, frequently updated
API response cache	60s (1 min)	External data, freshness matters
Session tokens	3600s (1 hr)	Matches auth session lifetime
Negative cache (missing data)	60s (1 min)	Re-check DB frequently
Rate limit counters	Varies	Matches window size (1s, 1min, 1hr)

TTL Jitter¶

ANTI_PATTERN: Setting identical TTLs on many related keys. FIX: Add random jitter to prevent synchronized expiration (cache stampede).

import random

BASE_TTL = 300  # 5 minutes
JITTER = 30     # +/- 30 seconds

ttl = BASE_TTL + random.randint(-JITTER, JITTER)
await redis.set(f"agent:{agent_id}", data, ex=ttl)

Cache Stampede Prevention¶

A cache stampede occurs when a popular key expires and hundreds of concurrent requests all miss the cache simultaneously, overwhelming the database.

Technique 1: Mutex Lock (GE Default)¶

Only one request rebuilds the cache; others wait or get stale data.

async def get_with_mutex(key: str, fetch_fn, ttl: int = 300) -> Any:
    # 1. Try cache
    cached = await redis.get(key)
    if cached:
        return json.loads(cached)

    # 2. Try to acquire rebuild lock
    lock_key = f"lock:{key}"
    acquired = await redis.set(lock_key, "1", nx=True, ex=10)

    if acquired:
        try:
            # 3. Rebuild cache
            data = await fetch_fn()
            jitter = random.randint(-30, 30)
            await redis.set(key, json.dumps(data), ex=ttl + jitter)
            return data
        finally:
            await redis.delete(lock_key)
    else:
        # 4. Another process is rebuilding — wait briefly and retry
        await asyncio.sleep(0.1)
        cached = await redis.get(key)
        if cached:
            return json.loads(cached)
        # Fallback to direct fetch if still no cache
        return await fetch_fn()

Technique 2: Probabilistic Early Refresh (High-Traffic Keys)¶

For extremely high-traffic keys where even brief lock contention is unacceptable.

async def get_with_early_refresh(
    key: str, fetch_fn, ttl: int = 300, beta: float = 1.0
) -> Any:
    # Read value + remaining TTL
    pipe = redis.pipeline()
    pipe.get(key)
    pipe.ttl(key)
    cached, remaining_ttl = await pipe.execute()

    if cached:
        # Probabilistic early refresh: higher chance as TTL approaches 0
        if remaining_ttl > 0:
            random_threshold = remaining_ttl - beta * math.log(random.random())
            if random_threshold > 0:
                return json.loads(cached)

    # Cache miss or early refresh triggered
    data = await fetch_fn()
    jitter = random.randint(-30, 30)
    await redis.set(key, json.dumps(data), ex=ttl + jitter)
    return data

Technique 3: Negative Caching¶

ANTI_PATTERN: Not caching "not found" results. FIX: Cache negative results with a short TTL to prevent stampedes on missing data.

async def get_agent(agent_id: str) -> Agent | None:
    null_key = f"agent:{agent_id}:null"
    if await redis.exists(null_key):
        return None  # Known missing — skip DB

    cached = await redis.get(f"agent:{agent_id}")
    if cached:
        return Agent.model_validate_json(cached)

    agent = await db.fetch_agent(agent_id)
    if agent is None:
        await redis.set(null_key, "1", ex=60)  # Cache miss for 1 minute
        return None

    await redis.set(f"agent:{agent_id}", agent.model_dump_json(), ex=300)
    return agent

Cache Key Naming¶

GE convention for cache keys:

{entity}:{identifier}                    → agent:martijn
{entity}:{identifier}:{field}            → agent:martijn:status
{entity}:{identifier}:null               → agent:nonexistent:null (negative cache)
{scope}:{entity}:{identifier}            → cache:api:weather:amsterdam
lock:{key}                               → lock:agent:martijn (mutex)
dedup:{work_item_id}                     → dedup:abc123 (stream dedup)
ratelimit:{scope}:{identifier}:{window}  → ratelimit:api:martijn:1min

ANTI_PATTERN: Using unstructured or inconsistent key names. FIX: Follow the {entity}:{identifier}:{qualifier} pattern. Consistent naming enables monitoring and bulk operations with SCAN.

GE-Specific Conventions¶

PostgreSQL is always SSOT — Cache is optimization only.
Every cache key has a TTL — No permanent cache entries.
TTL jitter on batch operations — Prevent synchronized expiration.
Mutex lock for expensive rebuilds — Prevent stampede.
Negative caching — Cache "not found" results to protect DB.
Delete on invalidation — Never update cache directly on write.
Max memory 256MB — allkeys-lru eviction. Do not cache large payloads.

Cross-References¶

READ_ALSO: wiki/docs/stack/redis/index.md READ_ALSO: wiki/docs/stack/redis/streams.md READ_ALSO: wiki/docs/stack/redis/patterns.md READ_ALSO: wiki/docs/stack/redis/pitfalls.md