CoreDNS and Kubernetes DNS: What Actually Happens When a Pod Looks Up a Name

Following a single DNS query from a pod's resolv.conf to the answer

Smarc Included in

30-07-2024 1973 words 10 min read

CoreDNS and Kubernetes DNS: What Actually Happens When a Pod Looks Up a Name

Contents

At 1am one Tuesday I watched a pod intermittently fail to reach a service it had been talking to happily for months. nslookup returned SERVFAIL maybe one query in five. No deploy, no config change, nothing in the obvious logs. It took me two hours and a fair amount of swearing to discover that one of the two CoreDNS replicas had been quietly OOM-killed and rescheduled onto a node a NetworkPolicy was blocking on UDP 53. Every query that load-balanced onto that pod died. I had been treating cluster DNS as a magic black box for years, and the black box had finally bitten me.

For something so fundamental, Kubernetes DNS is astonishingly easy to take for granted. You write http://my-service in your code, it resolves, traffic flows, everyone goes home. So let’s follow a single DNS lookup end to end. No magic, just a chain of unremarkable Linux mechanics wired together rather cleverly. Understand the chain once and you stop being the person staring blankly at SERVFAIL at 1am — which, in my experience, is the entire return on investment.

A note before we start: the examples below use a cluster whose DNS domain is cluster.example. The well-known Kubernetes default is the cluster\.local zone, which you’ll see on almost every real cluster; I’ve used a neutral domain here purely so nothing in this post fingerprints a specific environment. Everything that follows is identical regardless of which suffix your cluster uses — it’s set once at install time via the kubelet’s --cluster-domain flag.

It starts with resolv.conf

When the kubelet starts a pod, it writes an /etc/resolv.conf into it. Exec into almost any pod and you’ll see something like this:

1
2
3
nameserver 10.96.0.10
search default.svc.cluster.example svc.cluster.example cluster.example
options ndots:5

Three lines, and every one of them matters. The nameserver is the cluster DNS service — a ClusterIP (here 10.96.0.10, a private service IP) that fronts the CoreDNS pods. The search list is built from the pod’s namespace, so a pod in default gets default.svc.cluster.example first. And ndots:5 is the line that trips everyone up, which deserves its own paragraph below.

This file is not negotiable from inside the container unless you ask for it. If you need to override it — a different nameserver, a shorter ndots, extra search domains — you set dnsConfig and dnsPolicy on the pod spec, and the kubelet writes what you asked for instead. That hook is the cleanest fix for a whole category of DNS pain, so keep it in your back pocket.

The search domain dance

When your app asks for my-service, the C resolver (glibc) does not send my-service straight to the nameserver. The name has fewer than five dots, so ndots:5 tells the resolver to treat it as a relative name and try the search domains first. In order, it queries:

my-service.default.svc.cluster.example
my-service.svc.cluster.example
my-service.cluster.example
and only then my-service as an absolute name

The first one hits, so you never notice the rest. But ask for api.github.com — three dots, still under five — and the resolver dutifully tries api.github.com.default.svc.cluster.example, api.github.com.svc.cluster.example, and api.github.com.cluster.example, all guaranteed to fail, before finally querying api.github.com. as written. That’s three pointless round-trips for every external lookup, and because the resolver does A and AAAA, it can be six.

This is the single most common cause of “why is my DNS slow in Kubernetes.” Worse, those wildcard failures aren’t always instant — if an upstream resolver is sluggish to return NXDOMAIN, every external call your app makes carries that latency three times over. I’ve seen a service’s p99 latency halve from a one-character fix.

The fixes, in order of bluntness:

Append a trailing dot to fully-qualified external names in your config (api.github.com.). The dot makes the name absolute, so the resolver skips the search list entirely. This is the surgical fix and it costs nothing.
Lower ndots for that pod via dnsConfig, so names with fewer dots than the threshold are treated as absolute sooner.

1
2
3
4
5
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "1"

Be careful with the blunt option: drop ndots to 1 and short internal names like my-service stop resolving via the search path, because now they’re treated as absolute too. The trailing-dot approach is safer when you can control the names; the ndots override is for when you can’t and your workload is overwhelmingly external.

CoreDNS answers the phone

The query lands on CoreDNS, a small Go DNS server built entirely from plugins. Its behaviour lives in a ConfigMap called coredns in kube-system, and the file it produces is the Corefile:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
.:53 {
    errors
    health {
        lameduck 5s
    }
    ready
    kubernetes cluster.example in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

Read it top to bottom: it’s a pipeline. The kubernetes plugin is the star — it watches the API server for Services and Endpoints, holds them in memory, and answers any query under your cluster zone directly. For my-service.default.svc.cluster.example it returns the Service’s ClusterIP. For a headless service (clusterIP: None) it returns the individual pod IPs instead, which is exactly what StatefulSets and client-side load balancing rely on.

Anything that isn’t a cluster name falls through to forward . /etc/resolv.conf, which hands the query to the upstream resolvers the node uses — so external names go out to your real DNS. The cache 30 plugin keeps answers for 30 seconds so CoreDNS isn’t hammered, loop detects forwarding loops at startup and refuses to boot rather than melting down, reload watches the Corefile so a ConfigMap edit takes effect without a restart, and loadbalance shuffles A-record order for crude round-robin.

If you want to add a stub zone — say, forward everything under internal-svc to an on-prem resolver at 10.0.0.53 — you don’t hack the main block, you add a server block:

1
2
3
4
5
internal-svc:53 {
    errors
    cache 30
    forward . 10.0.0.53
}

Edit the ConfigMap, the reload plugin picks it up within a couple of minutes, and you’re done. No pod restart, no downtime.

The bit nobody mentions

CoreDNS doesn’t talk to etcd or scrape the network. It maintains an informer — a watch — against the Kubernetes API. Create a Service and within a second or two its name resolves, because CoreDNS got a watch event and updated its in-memory map. This is elegant, and it’s also the source of the failure modes that catch people out:

When the API server is unhealthy, CoreDNS can’t learn about new Services. Existing names keep resolving from the in-memory map, but freshly-created Services may not appear, which produces the maddening “it works for the old service but not the new one” symptom.
A CrashLoopBackOff on CoreDNS takes the whole cluster’s name resolution down with it. This is precisely why the default deployment runs two replicas and spreads them with anti-affinity. If you’ve scaled it to one to “save resources,” undo that — it’s a false economy.

Troubleshooting: a triage runbook

When DNS goes sideways, resist the urge to restart things randomly. Work the chain in order.

1. Is the plumbing fundamentally sound? Run a throwaway pod and resolve the API service:

1
2
3
4
5
6
$ kubectl run -it --rm dnstest --image=busybox:1.36 --restart=Never -- \
    nslookup kubernetes.default
Server:    10.96.0.10
Address:   10.96.0.10:53
Name:      kubernetes.default.svc.cluster.example
Address:   10.96.0.1

If that resolves, your DNS plumbing is sound and the problem is narrower than you fear. If it doesn’t, keep going.

2. Are the CoreDNS pods actually up?

1
2
3
4
$ kubectl -n kube-system get pods -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-7x8mz   1/1     Running   0          6d
coredns-5d78c9869d-q2lkp   1/1     Running   4          6d

Four restarts on one replica is a smell — check its events and logs for OOMKills. CoreDNS is memory-hungry on large clusters because the in-memory Service map scales with the number of Services and Endpoints.

3. Does the Service have endpoints? A CoreDNS that’s perfectly healthy will still hand back the right ClusterIP for a Service that has no ready pods behind it — your “DNS problem” is actually a readiness-probe or selector problem:

1
2
3
$ kubectl -n kube-system get endpoints kube-dns
NAME      ENDPOINTS                       AGE
kube-dns  10.244.0.5:53,10.244.1.7:53     6d

Empty endpoints means the pods aren’t ready, not that DNS is broken.

4. Is something eating UDP 53? This was my 1am culprit. A NetworkPolicy that defaults-deny egress will silently drop DNS unless you explicitly allow it. Check for policies in the namespace and make sure UDP/TCP 53 to kube-system is permitted. The symptom — intermittent SERVFAIL that correlates with which CoreDNS replica answered — is the giveaway.

5. Turn on the log plugin temporarily. Add log to the Corefile, reload, and you’ll see every query CoreDNS handles. Invaluable for confirming whether a query even reaches CoreDNS or dies before it. Remember to take it back out — at cluster scale it’s a firehose.

A few things worth knowing about scale and tuning

Two extra details separate people who use cluster DNS from people who can operate it.

NodeLocal DNSCache. On busy clusters, every pod hitting the central CoreDNS service over UDP runs into two problems: conntrack table pressure on the nodes, and a well-documented Linux kernel race in DNAT that intermittently drops UDP DNS packets and shows up as random 5-second resolution stalls. The fix is NodeLocal DNSCache — a small caching agent running as a DaemonSet on every node. Pods talk to the local cache over the loopback interface; the cache forwards misses to CoreDNS over TCP, sidestepping both the conntrack churn and the UDP race. If you’ve ever seen suspiciously round 5-second DNS delays under load, this is almost certainly your culprit, and NodeLocal DNSCache is the standard answer.

Cache TTLs are a real lever. The cache 30 in the Corefile and the ttl 30 on the kubernetes plugin are not arbitrary. Lower them and clients re-query more often — fresher answers, more load on CoreDNS and the API watch. Raise them and you cut load but propagate Service IP changes more slowly, which matters during rollouts where endpoints churn. The default 30 seconds is a sane middle ground; change it deliberately, with a reason, not because a blog post told you a bigger number is faster.

Negative caching counts too. A SERVFAIL or NXDOMAIN gets cached as well. If you fix a misconfigured Service but a client still can’t resolve it for half a minute, you’re very likely looking at a cached negative answer expiring, not a real failure. Knowing that saves you from “fixing” something that was already fixed.

Is understanding this worth it?

If you only ever run other people’s manifests on a managed cluster where someone else owns the control plane, you can probably coast on faith for years. But the moment you self-host — and especially the moment you debug intermittent timeouts that turn out to be the ndots:5 tax, or a NetworkPolicy quietly eating port 53 — this knowledge pays for itself in a single sitting. DNS is the layer everything else assumes works, which is exactly why its failures are so disorienting: nothing tells you DNS is the problem, because everything that depends on DNS just looks broken.

Spend an afternoon reading your Corefile and tracing one lookup by hand. The same instinct for following a request through the stack is what makes Git internals click once you see what git commit actually does, and it pays off again when you start untangling what Helm charts really do under the hood. It’s the same skill every time: stop trusting the abstraction, follow the bytes, and the magic turns into mechanics you can debug. That alone is worth the price of admission.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#kubernetes #networking #dns #troubleshooting

Contents

CoreDNS and Kubernetes DNS: What Actually Happens When a Pod Looks Up a Name

Following a single DNS query from a pod's resolv.conf to the answer

It starts with resolv.conf

The search domain dance

CoreDNS answers the phone

The bit nobody mentions

Troubleshooting: a triage runbook

A few things worth knowing about scale and tuning

Is understanding this worth it?

Related Content

Clearing your DNS cache

DNS Sinkholing: Blocking Malware Domains at the Network Level

Container Networking Debugging: tcpdump, nsenter, and What Packets Are Actually Doing

Why Your Kubernetes Cluster Crashes at 2 a.m. and How to Stop It