INC-20260403: DNS Resolution Infrastructure Failure¶

Classification: Incident (recurring, systemic) Severity: HIGH — blocked CI/CD, DAST scanning, agent execution, and host-to-cluster communication across 10+ sessions Duration: ~5 weeks (Feb 2026 – Apr 3 2026) Resolved: 2026-04-03

Impact¶

150+ pipeline runs affected by DNS-related failures
DAST scanning (ZAP, Nuclei) permanently broken until workaround applied
Agent executor pods intermittently failed DNS resolution
Vault connectivity required repeated debugging (HTTP vs HTTPS)
ArgoCD, Kaniko, GitLab runner all hit DNS issues at different points
Significant debugging time across multiple sessions (estimated 8+ hours cumulative)

Root Cause Analysis¶

Primary: CoreDNS NodeHosts was incomplete AND silently overwritten by k3s¶

k3s manages the coredns ConfigMap via its Addon controller (objectset.rio.cattle.io/owner-name: coredns). Manual edits to this ConfigMap to add hostnames like admin-ui.ge.internal were silently reverted on k3s restart or addon reconciliation.

NodeHosts had only 3 of 13 required Ingress hostnames. The missing 10 hostnames could not be resolved from inside pods, including CI runner pods.

Secondary: Host had no DNS path to CoreDNS¶

systemd-resolved on the host forwarded all queries to 192.168.1.1 (router). There was no configuration to route *.ge.internal or *.svc.cluster.local queries to CoreDNS (10.43.0.10). Host processes could only resolve entries manually added to /etc/hosts.

Contributing: /etc/hosts was inconsistent with CoreDNS¶

/etc/hosts had gitlab.ge.internal → 192.168.1.85 (host IP)
CoreDNS NodeHosts had gitlab.ge.internal → 10.43.71.244 (Traefik ClusterIP)
A hardcoded ClusterIP (10.43.37.214) for gitlab-minio would break on service recreation
Only 3 of 13 hostnames were listed

Contributing: Vault TLS confusion¶

Vault runs HTTPS with self-signed certs (VAULT_ADDR=https://127.0.0.1:8200, VAULT_SKIP_VERIFY=1). Code repeatedly tried http:// which returned "Client sent an HTTP request to an HTTPS server."

Resolution¶

R1: CoreDNS Custom Server Block (the permanent fix)¶

Created coredns-custom ConfigMap in kube-system namespace. This is the k3s-supported extension point — imported via import /etc/coredns/custom/*.server in the Corefile. k3s's Addon controller does NOT manage this ConfigMap, so it survives restarts.

Most Ingress hostnames point to Traefik ClusterIP 10.43.71.244. Exception: registry.ge.internal points to its own ClusterIP 10.43.148.219 because Kaniko pushes directly on port 5000 (not through Traefik).

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  ge-internal.server: |
    ge.internal:53 { hosts { 10.43.71.244 ... } }
    internal.ge-ops.local:53 { hosts { 10.43.71.244 ... } }
    internal.growing-europe.com:53 { hosts { 10.43.71.244 ... } }

R2: systemd-resolved Split DNS¶

Added drop-in config at /etc/systemd/resolved.conf.d/ge-internal.conf:

[Resolve]
DNS=10.43.0.10
Domains=~ge.internal ~internal.ge-ops.local ~internal.growing-europe.com ~svc.cluster.local

The ~ prefix means "routing domain" — only queries for those domains go to CoreDNS. Everything else goes to the default upstream DNS.

R3: /etc/hosts cleanup¶

Removed all *.ge.internal and *.svc.cluster.local entries from /etc/hosts. DNS handles everything now. Added comment explaining this.

Lessons Learned¶

k3s owns the main coredns ConfigMap. Never edit it directly. Use coredns-custom ConfigMap instead — it's the supported extension point.
All internal hostnames should resolve to Traefik's ClusterIP, not individual service ClusterIPs. Traefik handles Host-header routing. This is one address to maintain, not thirteen.
systemd-resolved split DNS is the industry standard for single-node k3s. It eliminates the need for /etc/hosts entirely.
/etc/hosts is a liability, not a solution. It overrides DNS, creates inconsistencies, and requires manual maintenance. Remove entries once proper DNS is in place.
Vault is HTTPS with self-signed certs. Always use https:// with -k or VAULT_SKIP_VERIFY=1. The http:// error message ("Client sent an HTTP request to an HTTPS server") is the giveaway.

Verification¶

# From a pod:
nslookup admin-ui.ge.internal  # → 10.43.71.244

# From the host:
nslookup admin-ui.ge.internal  # → 10.43.71.244
nslookup vault.ge-system.svc.cluster.local  # → 10.43.192.40

# Vault:
curl -sk https://10.43.192.40:8200/v1/sys/health  # → {"initialized":true, ...}

Future Maintenance¶

When adding a new Ingress hostname: 1. Add it to coredns-custom ConfigMap: kubectl edit cm coredns-custom -n kube-system 2. Restart CoreDNS: kubectl rollout restart deployment coredns -n kube-system 3. No /etc/hosts changes needed. No systemd-resolved changes needed.