Redis — Pitfalls¶
OWNER: gerco (infrastructure) ALSO_USED_BY: urszula, maxim (application usage) LAST_VERIFIED: 2026-03-26 GE_STACK_VERSION: Redis 7.4
Overview¶
Known failure modes in Redis usage within GE. Agents MUST check this page before writing code that interacts with Redis. New pitfalls are added here when discovered — items are NEVER removed.
Memory Exhaustion¶
Unbounded Streams¶
ANTI_PATTERN: XADD without MAXLEN — streams grow without limit.
FIX: Every XADD MUST include MAXLEN ~ {limit}. Per-agent: 100. System: 1000.
This is GE's most expensive Redis mistake to date. An unbounded triggers.* stream once consumed 180MB before detection, pushing Redis past its 256MB limit and triggering allkeys-lru eviction of active cache keys.
ADDED_FROM: ge-infrastructure-2026-02, unbounded stream incident
Keys Without TTL¶
ANTI_PATTERN: Setting cache keys without EXPIRE/EX. FIX: Every cache key MUST have a TTL. Even write-through keys get a safety TTL.
# Audit: find keys without TTL
# Run from redis-cli (not in application code)
# WARNING: SCAN only — never KEYS
redis-cli -p 6381 --scan --pattern "cache:*" | while read key; do
ttl=$(redis-cli -p 6381 TTL "$key")
if [ "$ttl" = "-1" ]; then
echo "NO TTL: $key"
fi
done
Big Keys¶
ANTI_PATTERN: Storing large values (>1MB) in a single key. FIX: Break large data into smaller keys or store in PostgreSQL with a Redis pointer.
Big keys cause:
- Slow operations (serialization/deserialization on every read/write).
- Memory fragmentation.
- Blocking during deletion (DEL on a 10MB key blocks the event loop).
CHECK: Agent is caching a large payload. IF: Payload size exceeds 100KB. THEN: Reconsider. Cache only the fields needed, or store in PostgreSQL with a lightweight Redis pointer.
No maxmemory Configuration¶
GE sets maxmemory 256mb with maxmemory-policy allkeys-lru.
If maxmemory is not set, Redis grows until the OS kills it (OOM).
CHECK: Agent is deploying Redis configuration.
IF: maxmemory is not set.
THEN: ALWAYS set maxmemory. GE standard: 256MB.
The KEYS Command¶
ANTI_PATTERN: Using KEYS * or KEYS pattern:* in application code.
FIX: Use SCAN with a cursor. KEYS performs an O(N) scan of the entire keyspace, blocking the single-threaded event loop for the duration. On a large keyspace, this freezes ALL Redis operations for seconds.
# WRONG — blocks Redis
keys = await redis.keys("agent:*")
# CORRECT — iterates without blocking
keys = []
async for key in redis.scan_iter(match="agent:*", count=100):
keys.append(key)
CHECK: Agent is searching for keys matching a pattern. IF: Code uses the KEYS command. THEN: Replace with SCAN. No exceptions, not even in admin tools or one-off scripts that run against production Redis.
ADDED_FROM: ge-infrastructure-2026-01, KEYS caused 3-second freeze during health check
Slow Commands¶
Commands to Avoid Under Load¶
| Command | Problem | Alternative |
|---|---|---|
KEYS * |
O(N) full keyspace scan, blocks event loop | SCAN |
SMEMBERS on large sets |
Returns all members at once | SSCAN |
HGETALL on large hashes |
Returns all fields at once | HSCAN or HMGET specific fields |
LRANGE 0 -1 on large lists |
Returns entire list | Paginate with LRANGE 0 99 |
ZRANGEBYSCORE without LIMIT |
Returns unbounded results | Add LIMIT offset count |
FLUSHALL / FLUSHDB |
Blocks until complete | FLUSHALL ASYNC (Redis 4.0+) |
DEL on large keys |
Blocks during free | UNLINK (async delete) |
CHECK: Agent is reading from a Redis collection (set, hash, list, sorted set).
IF: Collection may contain more than 100 elements.
THEN: Use the *SCAN variant or paginate. Never fetch all at once.
Monitoring Slow Commands¶
# Enable slow log (commands taking >10ms)
CONFIG SET slowlog-log-slower-than 10000
CONFIG SET slowlog-max-len 128
# View slow log
SLOWLOG GET 10
Connection Pooling¶
ANTI_PATTERN: Creating a new Redis connection per request. FIX: Use a connection pool. Opening TCP connections is expensive.
# CORRECT — connection pool (Python aioredis/redis-py)
redis = aioredis.from_url(
"redis://:password@localhost:6381",
max_connections=20,
decode_responses=True
)
// CORRECT — connection pool (Node.js ioredis)
const redis = new Redis({
host: 'localhost',
port: 6381,
password: process.env.REDIS_PASSWORD,
maxRetriesPerRequest: 3,
retryStrategy: (times) => Math.min(times * 200, 2000),
lazyConnect: true,
});
ANTI_PATTERN: Sharing a single connection across multiple async tasks without pipelining.
FIX: Use a connection pool with at least max(concurrent_tasks, 10) connections. A single connection serializes all commands.
Connection Exhaustion¶
GE Redis allows 100 max connections. Each service should use a pool of 10-20 connections, not more.
CHECK: Agent is configuring a Redis connection pool. IF: Pool size exceeds 20 per service. THEN: Reduce. With 5+ services, large pools exhaust the 100-connection limit.
Persistence (RDB vs AOF)¶
GE Configuration¶
GE uses AOF (appendonly yes) for persistence.
Stream data survives Redis restarts. RDB snapshots are also enabled as a backup.
| Setting | Value | Purpose |
|---|---|---|
appendonly |
yes | Persist every write to AOF |
appendfsync |
everysec | Balance durability and performance |
save 900 1 |
RDB backup | Snapshot if 1+ key changed in 15 min |
save 300 10 |
RDB backup | Snapshot if 10+ keys changed in 5 min |
AOF Pitfalls¶
ANTI_PATTERN: Setting appendfsync always — fsync on every write.
FIX: Use appendfsync everysec. "Always" reduces throughput by 10-100x and is unnecessary when PostgreSQL is SSOT.
ANTI_PATTERN: Never running BGREWRITEAOF.
FIX: Redis auto-rewrites AOF when it doubles in size, but monitor disk usage. A runaway AOF can fill the disk.
Replication Lag¶
GE currently runs a single Redis instance (no replication). If replication is added in the future:
- Reads from replicas may return stale data.
- Writes MUST go to the primary.
- Monitor
repl_backlog_sizeandmaster_repl_offsetvsslave_repl_offset.
Pub/Sub Pitfalls¶
ANTI_PATTERN: Using Pub/Sub for work that must be processed exactly once. FIX: Use Streams with consumer groups. Pub/Sub messages are lost if no subscriber is listening at the moment of publish.
ANTI_PATTERN: Subscriber that processes slowly, causing the output buffer to grow.
FIX: Set client-output-buffer-limit pubsub to prevent slow subscribers from exhausting memory. GE default: 32mb 8mb 60.
GE-Specific Pitfalls (Historical)¶
Token Burn — Double Delivery¶
ANTI_PATTERN: XADDing to BOTH triggers.{agent} AND ge:work:incoming for the same task.
FIX: Publish to ONE stream only. The orchestrator routes from incoming to per-agent triggers. Double-publishing caused 2x execution cost.
ADDED_FROM: ge-token-burn-2026-02-13, task-service.ts double XADD
Hook Cross-Trigger Loop¶
ANTI_PATTERN: Post-completion hook with condition "always" at no_block tier. FIX: Never use "always" condition on no_block hooks. This caused an infinite Annegreet-Eltjo loop that burned tokens until the rate limiter kicked in. ADDED_FROM: ge-token-burn-2026-02-13, annegreet-eltjo loop
Port Confusion¶
ANTI_PATTERN: Connecting to Redis on port 6379 (the default).
FIX: GE Redis is on port 6381. Read config/ports.yaml. This has bitten every new agent at least once.
ADDED_FROM: ge-infrastructure-2026-01, multiple incidents
Redis Auth in k8s Probes¶
ANTI_PATTERN: k8s liveness/readiness probes that connect to Redis without authentication.
FIX: Probes must include the password from ge-secrets. Unauthenticated probes get rejected, pod enters CrashLoopBackOff.
ADDED_FROM: ge-orchestrator-deploy-2026-03-19, probe failure
Cross-References¶
READ_ALSO: wiki/docs/stack/redis/index.md READ_ALSO: wiki/docs/stack/redis/streams.md READ_ALSO: wiki/docs/stack/redis/caching.md READ_ALSO: wiki/docs/stack/redis/checklist.md