Redis — Pitfalls¶

OWNER: gerco (infrastructure) ALSO_USED_BY: urszula, maxim (application usage) LAST_VERIFIED: 2026-03-26 GE_STACK_VERSION: Redis 7.4

Overview¶

Known failure modes in Redis usage within GE. Agents MUST check this page before writing code that interacts with Redis. New pitfalls are added here when discovered — items are NEVER removed.

Memory Exhaustion¶

Unbounded Streams¶

ANTI_PATTERN: XADD without MAXLEN — streams grow without limit. FIX: Every XADD MUST include MAXLEN ~ {limit}. Per-agent: 100. System: 1000.

This is GE's most expensive Redis mistake to date. An unbounded triggers.* stream once consumed 180MB before detection, pushing Redis past its 256MB limit and triggering allkeys-lru eviction of active cache keys. ADDED_FROM: ge-infrastructure-2026-02, unbounded stream incident

Keys Without TTL¶

ANTI_PATTERN: Setting cache keys without EXPIRE/EX. FIX: Every cache key MUST have a TTL. Even write-through keys get a safety TTL.

# Audit: find keys without TTL
# Run from redis-cli (not in application code)
# WARNING: SCAN only — never KEYS
redis-cli -p 6381 --scan --pattern "cache:*" | while read key; do
    ttl=$(redis-cli -p 6381 TTL "$key")
    if [ "$ttl" = "-1" ]; then
        echo "NO TTL: $key"
    fi
done

Big Keys¶

ANTI_PATTERN: Storing large values (>1MB) in a single key. FIX: Break large data into smaller keys or store in PostgreSQL with a Redis pointer.

Big keys cause:

Slow operations (serialization/deserialization on every read/write).
Memory fragmentation.
Blocking during deletion (DEL on a 10MB key blocks the event loop).

CHECK: Agent is caching a large payload. IF: Payload size exceeds 100KB. THEN: Reconsider. Cache only the fields needed, or store in PostgreSQL with a lightweight Redis pointer.

No maxmemory Configuration¶

GE sets maxmemory 256mb with maxmemory-policy allkeys-lru. If maxmemory is not set, Redis grows until the OS kills it (OOM).

CHECK: Agent is deploying Redis configuration. IF: maxmemory is not set. THEN: ALWAYS set maxmemory. GE standard: 256MB.

The KEYS Command¶

ANTI_PATTERN: Using KEYS * or KEYS pattern:* in application code. FIX: Use SCAN with a cursor. KEYS performs an O(N) scan of the entire keyspace, blocking the single-threaded event loop for the duration. On a large keyspace, this freezes ALL Redis operations for seconds.

# WRONG — blocks Redis
keys = await redis.keys("agent:*")

# CORRECT — iterates without blocking
keys = []
async for key in redis.scan_iter(match="agent:*", count=100):
    keys.append(key)

CHECK: Agent is searching for keys matching a pattern. IF: Code uses the KEYS command. THEN: Replace with SCAN. No exceptions, not even in admin tools or one-off scripts that run against production Redis.

ADDED_FROM: ge-infrastructure-2026-01, KEYS caused 3-second freeze during health check

Slow Commands¶

Commands to Avoid Under Load¶

Command	Problem	Alternative
`KEYS *`	O(N) full keyspace scan, blocks event loop	`SCAN`
`SMEMBERS` on large sets	Returns all members at once	`SSCAN`
`HGETALL` on large hashes	Returns all fields at once	`HSCAN` or `HMGET` specific fields
`LRANGE 0 -1` on large lists	Returns entire list	Paginate with `LRANGE 0 99`
`ZRANGEBYSCORE` without LIMIT	Returns unbounded results	Add `LIMIT offset count`
`FLUSHALL` / `FLUSHDB`	Blocks until complete	`FLUSHALL ASYNC` (Redis 4.0+)
`DEL` on large keys	Blocks during free	`UNLINK` (async delete)

CHECK: Agent is reading from a Redis collection (set, hash, list, sorted set). IF: Collection may contain more than 100 elements. THEN: Use the *SCAN variant or paginate. Never fetch all at once.

Monitoring Slow Commands¶

# Enable slow log (commands taking >10ms)
CONFIG SET slowlog-log-slower-than 10000
CONFIG SET slowlog-max-len 128

# View slow log
SLOWLOG GET 10

Connection Pooling¶

ANTI_PATTERN: Creating a new Redis connection per request. FIX: Use a connection pool. Opening TCP connections is expensive.

# CORRECT — connection pool (Python aioredis/redis-py)
redis = aioredis.from_url(
    "redis://:password@localhost:6381",
    max_connections=20,
    decode_responses=True
)

// CORRECT — connection pool (Node.js ioredis)
const redis = new Redis({
    host: 'localhost',
    port: 6381,
    password: process.env.REDIS_PASSWORD,
    maxRetriesPerRequest: 3,
    retryStrategy: (times) => Math.min(times * 200, 2000),
    lazyConnect: true,
});

ANTI_PATTERN: Sharing a single connection across multiple async tasks without pipelining. FIX: Use a connection pool with at least max(concurrent_tasks, 10) connections. A single connection serializes all commands.

Connection Exhaustion¶

GE Redis allows 100 max connections. Each service should use a pool of 10-20 connections, not more.

CHECK: Agent is configuring a Redis connection pool. IF: Pool size exceeds 20 per service. THEN: Reduce. With 5+ services, large pools exhaust the 100-connection limit.

Persistence (RDB vs AOF)¶

GE Configuration¶

GE uses AOF (appendonly yes) for persistence. Stream data survives Redis restarts. RDB snapshots are also enabled as a backup.

Setting	Value	Purpose
`appendonly`	yes	Persist every write to AOF
`appendfsync`	everysec	Balance durability and performance
`save 900 1`	RDB backup	Snapshot if 1+ key changed in 15 min
`save 300 10`	RDB backup	Snapshot if 10+ keys changed in 5 min

AOF Pitfalls¶

ANTI_PATTERN: Setting appendfsync always — fsync on every write. FIX: Use appendfsync everysec. "Always" reduces throughput by 10-100x and is unnecessary when PostgreSQL is SSOT.

ANTI_PATTERN: Never running BGREWRITEAOF. FIX: Redis auto-rewrites AOF when it doubles in size, but monitor disk usage. A runaway AOF can fill the disk.

Replication Lag¶

GE currently runs a single Redis instance (no replication). If replication is added in the future:

Reads from replicas may return stale data.
Writes MUST go to the primary.
Monitor repl_backlog_size and master_repl_offset vs slave_repl_offset.

Pub/Sub Pitfalls¶

ANTI_PATTERN: Using Pub/Sub for work that must be processed exactly once. FIX: Use Streams with consumer groups. Pub/Sub messages are lost if no subscriber is listening at the moment of publish.

ANTI_PATTERN: Subscriber that processes slowly, causing the output buffer to grow. FIX: Set client-output-buffer-limit pubsub to prevent slow subscribers from exhausting memory. GE default: 32mb 8mb 60.

GE-Specific Pitfalls (Historical)¶

Token Burn — Double Delivery¶

ANTI_PATTERN: XADDing to BOTH triggers.{agent} AND ge:work:incoming for the same task. FIX: Publish to ONE stream only. The orchestrator routes from incoming to per-agent triggers. Double-publishing caused 2x execution cost. ADDED_FROM: ge-token-burn-2026-02-13, task-service.ts double XADD

Hook Cross-Trigger Loop¶

ANTI_PATTERN: Post-completion hook with condition "always" at no_block tier. FIX: Never use "always" condition on no_block hooks. This caused an infinite Annegreet-Eltjo loop that burned tokens until the rate limiter kicked in. ADDED_FROM: ge-token-burn-2026-02-13, annegreet-eltjo loop

Port Confusion¶

ANTI_PATTERN: Connecting to Redis on port 6379 (the default). FIX: GE Redis is on port 6381. Read config/ports.yaml. This has bitten every new agent at least once. ADDED_FROM: ge-infrastructure-2026-01, multiple incidents

Redis Auth in k8s Probes¶

ANTI_PATTERN: k8s liveness/readiness probes that connect to Redis without authentication. FIX: Probes must include the password from ge-secrets. Unauthenticated probes get rejected, pod enters CrashLoopBackOff. ADDED_FROM: ge-orchestrator-deploy-2026-03-19, probe failure

Cross-References¶

READ_ALSO: wiki/docs/stack/redis/index.md READ_ALSO: wiki/docs/stack/redis/streams.md READ_ALSO: wiki/docs/stack/redis/caching.md READ_ALSO: wiki/docs/stack/redis/checklist.md