INC-20260403: DNS Resolution Infrastructure Failure¶
Classification: Incident (recurring, systemic) Severity: HIGH — blocked CI/CD, DAST scanning, agent execution, and host-to-cluster communication across 10+ sessions Duration: ~5 weeks (Feb 2026 – Apr 3 2026) Resolved: 2026-04-03
Impact¶
- 150+ pipeline runs affected by DNS-related failures
- DAST scanning (ZAP, Nuclei) permanently broken until workaround applied
- Agent executor pods intermittently failed DNS resolution
- Vault connectivity required repeated debugging (HTTP vs HTTPS)
- ArgoCD, Kaniko, GitLab runner all hit DNS issues at different points
- Significant debugging time across multiple sessions (estimated 8+ hours cumulative)
Root Cause Analysis¶
Primary: CoreDNS NodeHosts was incomplete AND silently overwritten by k3s¶
k3s manages the coredns ConfigMap via its Addon controller (objectset.rio.cattle.io/owner-name: coredns). Manual edits to this ConfigMap to add hostnames like admin-ui.ge.internal were silently reverted on k3s restart or addon reconciliation.
NodeHosts had only 3 of 13 required Ingress hostnames. The missing 10 hostnames could not be resolved from inside pods, including CI runner pods.
Secondary: Host had no DNS path to CoreDNS¶
systemd-resolved on the host forwarded all queries to 192.168.1.1 (router). There was no configuration to route *.ge.internal or *.svc.cluster.local queries to CoreDNS (10.43.0.10). Host processes could only resolve entries manually added to /etc/hosts.
Contributing: /etc/hosts was inconsistent with CoreDNS¶
/etc/hostshadgitlab.ge.internal → 192.168.1.85(host IP)- CoreDNS NodeHosts had
gitlab.ge.internal → 10.43.71.244(Traefik ClusterIP) - A hardcoded ClusterIP (
10.43.37.214) for gitlab-minio would break on service recreation - Only 3 of 13 hostnames were listed
Contributing: Vault TLS confusion¶
Vault runs HTTPS with self-signed certs (VAULT_ADDR=https://127.0.0.1:8200, VAULT_SKIP_VERIFY=1). Code repeatedly tried http:// which returned "Client sent an HTTP request to an HTTPS server."
Resolution¶
R1: CoreDNS Custom Server Block (the permanent fix)¶
Created coredns-custom ConfigMap in kube-system namespace. This is the k3s-supported extension point — imported via import /etc/coredns/custom/*.server in the Corefile. k3s's Addon controller does NOT manage this ConfigMap, so it survives restarts.
Most Ingress hostnames point to Traefik ClusterIP 10.43.71.244. Exception: registry.ge.internal points to its own ClusterIP 10.43.148.219 because Kaniko pushes directly on port 5000 (not through Traefik).
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns-custom
namespace: kube-system
data:
ge-internal.server: |
ge.internal:53 { hosts { 10.43.71.244 ... } }
internal.ge-ops.local:53 { hosts { 10.43.71.244 ... } }
internal.growing-europe.com:53 { hosts { 10.43.71.244 ... } }
R2: systemd-resolved Split DNS¶
Added drop-in config at /etc/systemd/resolved.conf.d/ge-internal.conf:
[Resolve]
DNS=10.43.0.10
Domains=~ge.internal ~internal.ge-ops.local ~internal.growing-europe.com ~svc.cluster.local
The ~ prefix means "routing domain" — only queries for those domains go to CoreDNS. Everything else goes to the default upstream DNS.
R3: /etc/hosts cleanup¶
Removed all *.ge.internal and *.svc.cluster.local entries from /etc/hosts. DNS handles everything now. Added comment explaining this.
Lessons Learned¶
-
k3s owns the main coredns ConfigMap. Never edit it directly. Use
coredns-customConfigMap instead — it's the supported extension point. -
All internal hostnames should resolve to Traefik's ClusterIP, not individual service ClusterIPs. Traefik handles Host-header routing. This is one address to maintain, not thirteen.
-
systemd-resolved split DNS is the industry standard for single-node k3s. It eliminates the need for
/etc/hostsentirely. -
/etc/hostsis a liability, not a solution. It overrides DNS, creates inconsistencies, and requires manual maintenance. Remove entries once proper DNS is in place. -
Vault is HTTPS with self-signed certs. Always use
https://with-korVAULT_SKIP_VERIFY=1. Thehttp://error message ("Client sent an HTTP request to an HTTPS server") is the giveaway.
Verification¶
# From a pod:
nslookup admin-ui.ge.internal # → 10.43.71.244
# From the host:
nslookup admin-ui.ge.internal # → 10.43.71.244
nslookup vault.ge-system.svc.cluster.local # → 10.43.192.40
# Vault:
curl -sk https://10.43.192.40:8200/v1/sys/health # → {"initialized":true, ...}
Future Maintenance¶
When adding a new Ingress hostname:
1. Add it to coredns-custom ConfigMap: kubectl edit cm coredns-custom -n kube-system
2. Restart CoreDNS: kubectl rollout restart deployment coredns -n kube-system
3. No /etc/hosts changes needed. No systemd-resolved changes needed.