Skip to content

INC-20260403: DNS Resolution Infrastructure Failure

Classification: Incident (recurring, systemic) Severity: HIGH — blocked CI/CD, DAST scanning, agent execution, and host-to-cluster communication across 10+ sessions Duration: ~5 weeks (Feb 2026 – Apr 3 2026) Resolved: 2026-04-03

Impact

  • 150+ pipeline runs affected by DNS-related failures
  • DAST scanning (ZAP, Nuclei) permanently broken until workaround applied
  • Agent executor pods intermittently failed DNS resolution
  • Vault connectivity required repeated debugging (HTTP vs HTTPS)
  • ArgoCD, Kaniko, GitLab runner all hit DNS issues at different points
  • Significant debugging time across multiple sessions (estimated 8+ hours cumulative)

Root Cause Analysis

Primary: CoreDNS NodeHosts was incomplete AND silently overwritten by k3s

k3s manages the coredns ConfigMap via its Addon controller (objectset.rio.cattle.io/owner-name: coredns). Manual edits to this ConfigMap to add hostnames like admin-ui.ge.internal were silently reverted on k3s restart or addon reconciliation.

NodeHosts had only 3 of 13 required Ingress hostnames. The missing 10 hostnames could not be resolved from inside pods, including CI runner pods.

Secondary: Host had no DNS path to CoreDNS

systemd-resolved on the host forwarded all queries to 192.168.1.1 (router). There was no configuration to route *.ge.internal or *.svc.cluster.local queries to CoreDNS (10.43.0.10). Host processes could only resolve entries manually added to /etc/hosts.

Contributing: /etc/hosts was inconsistent with CoreDNS

  • /etc/hosts had gitlab.ge.internal → 192.168.1.85 (host IP)
  • CoreDNS NodeHosts had gitlab.ge.internal → 10.43.71.244 (Traefik ClusterIP)
  • A hardcoded ClusterIP (10.43.37.214) for gitlab-minio would break on service recreation
  • Only 3 of 13 hostnames were listed

Contributing: Vault TLS confusion

Vault runs HTTPS with self-signed certs (VAULT_ADDR=https://127.0.0.1:8200, VAULT_SKIP_VERIFY=1). Code repeatedly tried http:// which returned "Client sent an HTTP request to an HTTPS server."

Resolution

R1: CoreDNS Custom Server Block (the permanent fix)

Created coredns-custom ConfigMap in kube-system namespace. This is the k3s-supported extension point — imported via import /etc/coredns/custom/*.server in the Corefile. k3s's Addon controller does NOT manage this ConfigMap, so it survives restarts.

Most Ingress hostnames point to Traefik ClusterIP 10.43.71.244. Exception: registry.ge.internal points to its own ClusterIP 10.43.148.219 because Kaniko pushes directly on port 5000 (not through Traefik).

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns-custom
  namespace: kube-system
data:
  ge-internal.server: |
    ge.internal:53 { hosts { 10.43.71.244 ... } }
    internal.ge-ops.local:53 { hosts { 10.43.71.244 ... } }
    internal.growing-europe.com:53 { hosts { 10.43.71.244 ... } }

R2: systemd-resolved Split DNS

Added drop-in config at /etc/systemd/resolved.conf.d/ge-internal.conf:

[Resolve]
DNS=10.43.0.10
Domains=~ge.internal ~internal.ge-ops.local ~internal.growing-europe.com ~svc.cluster.local

The ~ prefix means "routing domain" — only queries for those domains go to CoreDNS. Everything else goes to the default upstream DNS.

R3: /etc/hosts cleanup

Removed all *.ge.internal and *.svc.cluster.local entries from /etc/hosts. DNS handles everything now. Added comment explaining this.

Lessons Learned

  1. k3s owns the main coredns ConfigMap. Never edit it directly. Use coredns-custom ConfigMap instead — it's the supported extension point.

  2. All internal hostnames should resolve to Traefik's ClusterIP, not individual service ClusterIPs. Traefik handles Host-header routing. This is one address to maintain, not thirteen.

  3. systemd-resolved split DNS is the industry standard for single-node k3s. It eliminates the need for /etc/hosts entirely.

  4. /etc/hosts is a liability, not a solution. It overrides DNS, creates inconsistencies, and requires manual maintenance. Remove entries once proper DNS is in place.

  5. Vault is HTTPS with self-signed certs. Always use https:// with -k or VAULT_SKIP_VERIFY=1. The http:// error message ("Client sent an HTTP request to an HTTPS server") is the giveaway.

Verification

# From a pod:
nslookup admin-ui.ge.internal  # → 10.43.71.244

# From the host:
nslookup admin-ui.ge.internal  # → 10.43.71.244
nslookup vault.ge-system.svc.cluster.local  # → 10.43.192.40

# Vault:
curl -sk https://10.43.192.40:8200/v1/sys/health  # → {"initialized":true, ...}

Future Maintenance

When adding a new Ingress hostname: 1. Add it to coredns-custom ConfigMap: kubectl edit cm coredns-custom -n kube-system 2. Restart CoreDNS: kubectl rollout restart deployment coredns -n kube-system 3. No /etc/hosts changes needed. No systemd-resolved changes needed.