Skip to content

Redis — Pitfalls

OWNER: gerco (infrastructure) ALSO_USED_BY: urszula, maxim (application usage) LAST_VERIFIED: 2026-03-26 GE_STACK_VERSION: Redis 7.4


Overview

Known failure modes in Redis usage within GE. Agents MUST check this page before writing code that interacts with Redis. New pitfalls are added here when discovered — items are NEVER removed.

Memory Exhaustion

Unbounded Streams

ANTI_PATTERN: XADD without MAXLEN — streams grow without limit. FIX: Every XADD MUST include MAXLEN ~ {limit}. Per-agent: 100. System: 1000.

This is GE's most expensive Redis mistake to date. An unbounded triggers.* stream once consumed 180MB before detection, pushing Redis past its 256MB limit and triggering allkeys-lru eviction of active cache keys. ADDED_FROM: ge-infrastructure-2026-02, unbounded stream incident

Keys Without TTL

ANTI_PATTERN: Setting cache keys without EXPIRE/EX. FIX: Every cache key MUST have a TTL. Even write-through keys get a safety TTL.

# Audit: find keys without TTL
# Run from redis-cli (not in application code)
# WARNING: SCAN only — never KEYS
redis-cli -p 6381 --scan --pattern "cache:*" | while read key; do
    ttl=$(redis-cli -p 6381 TTL "$key")
    if [ "$ttl" = "-1" ]; then
        echo "NO TTL: $key"
    fi
done

Big Keys

ANTI_PATTERN: Storing large values (>1MB) in a single key. FIX: Break large data into smaller keys or store in PostgreSQL with a Redis pointer.

Big keys cause:

  • Slow operations (serialization/deserialization on every read/write).
  • Memory fragmentation.
  • Blocking during deletion (DEL on a 10MB key blocks the event loop).

CHECK: Agent is caching a large payload. IF: Payload size exceeds 100KB. THEN: Reconsider. Cache only the fields needed, or store in PostgreSQL with a lightweight Redis pointer.

No maxmemory Configuration

GE sets maxmemory 256mb with maxmemory-policy allkeys-lru. If maxmemory is not set, Redis grows until the OS kills it (OOM).

CHECK: Agent is deploying Redis configuration. IF: maxmemory is not set. THEN: ALWAYS set maxmemory. GE standard: 256MB.

The KEYS Command

ANTI_PATTERN: Using KEYS * or KEYS pattern:* in application code. FIX: Use SCAN with a cursor. KEYS performs an O(N) scan of the entire keyspace, blocking the single-threaded event loop for the duration. On a large keyspace, this freezes ALL Redis operations for seconds.

# WRONG — blocks Redis
keys = await redis.keys("agent:*")

# CORRECT — iterates without blocking
keys = []
async for key in redis.scan_iter(match="agent:*", count=100):
    keys.append(key)

CHECK: Agent is searching for keys matching a pattern. IF: Code uses the KEYS command. THEN: Replace with SCAN. No exceptions, not even in admin tools or one-off scripts that run against production Redis.

ADDED_FROM: ge-infrastructure-2026-01, KEYS caused 3-second freeze during health check

Slow Commands

Commands to Avoid Under Load

Command Problem Alternative
KEYS * O(N) full keyspace scan, blocks event loop SCAN
SMEMBERS on large sets Returns all members at once SSCAN
HGETALL on large hashes Returns all fields at once HSCAN or HMGET specific fields
LRANGE 0 -1 on large lists Returns entire list Paginate with LRANGE 0 99
ZRANGEBYSCORE without LIMIT Returns unbounded results Add LIMIT offset count
FLUSHALL / FLUSHDB Blocks until complete FLUSHALL ASYNC (Redis 4.0+)
DEL on large keys Blocks during free UNLINK (async delete)

CHECK: Agent is reading from a Redis collection (set, hash, list, sorted set). IF: Collection may contain more than 100 elements. THEN: Use the *SCAN variant or paginate. Never fetch all at once.

Monitoring Slow Commands

# Enable slow log (commands taking >10ms)
CONFIG SET slowlog-log-slower-than 10000
CONFIG SET slowlog-max-len 128

# View slow log
SLOWLOG GET 10

Connection Pooling

ANTI_PATTERN: Creating a new Redis connection per request. FIX: Use a connection pool. Opening TCP connections is expensive.

# CORRECT — connection pool (Python aioredis/redis-py)
redis = aioredis.from_url(
    "redis://:password@localhost:6381",
    max_connections=20,
    decode_responses=True
)
// CORRECT — connection pool (Node.js ioredis)
const redis = new Redis({
    host: 'localhost',
    port: 6381,
    password: process.env.REDIS_PASSWORD,
    maxRetriesPerRequest: 3,
    retryStrategy: (times) => Math.min(times * 200, 2000),
    lazyConnect: true,
});

ANTI_PATTERN: Sharing a single connection across multiple async tasks without pipelining. FIX: Use a connection pool with at least max(concurrent_tasks, 10) connections. A single connection serializes all commands.

Connection Exhaustion

GE Redis allows 100 max connections. Each service should use a pool of 10-20 connections, not more.

CHECK: Agent is configuring a Redis connection pool. IF: Pool size exceeds 20 per service. THEN: Reduce. With 5+ services, large pools exhaust the 100-connection limit.

Persistence (RDB vs AOF)

GE Configuration

GE uses AOF (appendonly yes) for persistence. Stream data survives Redis restarts. RDB snapshots are also enabled as a backup.

Setting Value Purpose
appendonly yes Persist every write to AOF
appendfsync everysec Balance durability and performance
save 900 1 RDB backup Snapshot if 1+ key changed in 15 min
save 300 10 RDB backup Snapshot if 10+ keys changed in 5 min

AOF Pitfalls

ANTI_PATTERN: Setting appendfsync always — fsync on every write. FIX: Use appendfsync everysec. "Always" reduces throughput by 10-100x and is unnecessary when PostgreSQL is SSOT.

ANTI_PATTERN: Never running BGREWRITEAOF. FIX: Redis auto-rewrites AOF when it doubles in size, but monitor disk usage. A runaway AOF can fill the disk.

Replication Lag

GE currently runs a single Redis instance (no replication). If replication is added in the future:

  • Reads from replicas may return stale data.
  • Writes MUST go to the primary.
  • Monitor repl_backlog_size and master_repl_offset vs slave_repl_offset.

Pub/Sub Pitfalls

ANTI_PATTERN: Using Pub/Sub for work that must be processed exactly once. FIX: Use Streams with consumer groups. Pub/Sub messages are lost if no subscriber is listening at the moment of publish.

ANTI_PATTERN: Subscriber that processes slowly, causing the output buffer to grow. FIX: Set client-output-buffer-limit pubsub to prevent slow subscribers from exhausting memory. GE default: 32mb 8mb 60.

GE-Specific Pitfalls (Historical)

Token Burn — Double Delivery

ANTI_PATTERN: XADDing to BOTH triggers.{agent} AND ge:work:incoming for the same task. FIX: Publish to ONE stream only. The orchestrator routes from incoming to per-agent triggers. Double-publishing caused 2x execution cost. ADDED_FROM: ge-token-burn-2026-02-13, task-service.ts double XADD

Hook Cross-Trigger Loop

ANTI_PATTERN: Post-completion hook with condition "always" at no_block tier. FIX: Never use "always" condition on no_block hooks. This caused an infinite Annegreet-Eltjo loop that burned tokens until the rate limiter kicked in. ADDED_FROM: ge-token-burn-2026-02-13, annegreet-eltjo loop

Port Confusion

ANTI_PATTERN: Connecting to Redis on port 6379 (the default). FIX: GE Redis is on port 6381. Read config/ports.yaml. This has bitten every new agent at least once. ADDED_FROM: ge-infrastructure-2026-01, multiple incidents

Redis Auth in k8s Probes

ANTI_PATTERN: k8s liveness/readiness probes that connect to Redis without authentication. FIX: Probes must include the password from ge-secrets. Unauthenticated probes get rejected, pod enters CrashLoopBackOff. ADDED_FROM: ge-orchestrator-deploy-2026-03-19, probe failure

Cross-References

READ_ALSO: wiki/docs/stack/redis/index.md READ_ALSO: wiki/docs/stack/redis/streams.md READ_ALSO: wiki/docs/stack/redis/caching.md READ_ALSO: wiki/docs/stack/redis/checklist.md