Terminating pods can still receive traffic via a service

Based on searching The Internet and asking GitHub Copilot, I have concluded that it is a little-known fact that terminating pods can still receive traffic from a Kubernetes Service (on a sufficiently recent release of Kubernetes).

What do I mean by sufficiently recent? I mean Kubernetes v1.26 (December 2022) or beyond.

Conventional wisdom

Conventional wisdom is that a Service routes traffic to pods listed in the subsets of an Endpoint object. However, since Kubernetes 1.19 (August 2020) this is no longer the case! A Service actually routes traffic to pods listed in EndpointSlice objects! In case you missed it, it’s there in the CHANGELOG for v1.19.

I’ve taken it out of context, but this statement in the CHANGELOG rings true to me:

mostly be an invisible change

An animated gif poking fun at the phrase mostly invisible

If you haven’t recalled the “mostly invisible” change introduced in v1.19, you might notice that traffic is flowing to terminating pods, despite the output of kubectl looking somewhat like this:

# kubectl get pods -n the-namespace | grep the-app
the-app-5968b5d75f-5jdjs                                   2/2     Terminating   0             3h49m
the-app-5968b5d75f-7sfpk                                   2/2     Terminating   0             124m
the-app-5968b5d75f-8fmpm                                   2/2     Terminating   0             4h58m
the-app-5968b5d75f-92qdb                                   2/2     Terminating   0             14h
the-app-5968b5d75f-bxg95                                   2/2     Terminating   0             4h16m
the-app-5968b5d75f-dzgxc                                   2/2     Terminating   0             14h
the-app-5968b5d75f-k6684                                   2/2     Terminating   0             100m

# kubectl get ep -n the-namespace the-endpoint
NAME             ENDPOINTS   AGE
the-endpoint     <none>      47d

Endpoints are <none>, but traffic is flowing? Better have a look at the YAML output to double check:

# kubectl get ep -n the-namespace the-endpoint -o yaml
apiVersion: v1
kind: Endpoints
metadata:
  creationTimestamp: "2024-03-13T05:04:51Z"
  name: the-endpoint
  namespace: the-namespace
  resourceVersion: "2137011176"
  uid: f5ae4db1-3764-406a-8352-d7270bdf412d

Yup, there is definitely not a subsets array. “Mostly invisible” is right, for all the wrong reasons!

However, our trusty curl requests to the Service are definitely working, so what is going on?

Ready vs Serving

EndpointSlices introduced a serving condition in Kubernetes v1.20 (December 2020). The documentation for serving has a description that reminds me of the “mostly invisible” description of EndpointSlices themselves:

The serving condition is almost identical to the ready condition

An animated gif poking fun at the phrase almost identical

Great, what is the difference then? The difference is that terminating pods can never be ready, but they can be serving. The documentation goes on to say:

The difference is that consumers of the EndpointSlice API should check the serving condition if they care about pod readiness while the pod is also terminating.

Ah huh! So that seems interesting. What would care about pod readiness whilst terminating? All signs pointed to kube-proxy.

A screenshot from Slack after several hours troubleshooting

kube-proxy

Trusty old kube-proxy is the component that creates IPtables rules (in EKS) or IPVS rules (in our self-hosted clusters) to route Service traffic to pods. And now we know that since Kubernetes v1.19, it uses EndpointSlices to determine which pods to route traffic to. It seems that kube-proxy also cares about the serving condition.

To confirm that, I increased the verbosity of kube-proxy from --v=2 to --v=9. To create a surprising example of something I would not expect to work, but does, I setup a pod to be not ready and terminating. Then I transitioned the pod back to ready (but still terminating) and found the following logs:

I0430 06:41:58.927894       1 config.go:133] "Calling handler.OnEndpointSliceUpdate"
I0430 06:41:58.928139       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="the-namespace/the-app:http" endpoints=["10.104.88.233:5000"]
I0430 06:41:58.928153       1 endpointslicecache.go:373] "Setting endpoints for service port name" portName="the-namespace/the-app:http" endpoints=["10.104.88.233:5000"]
I0430 06:41:58.928172       1 proxier.go:796] "Syncing iptables rules"

My whole reality was shattered! I had always assumed that kube-proxy would only route traffic to ready pods and that terminating pods could never be ready. But here we are, with a terminating (but ready) pod receiving traffic from a Service.

I still was puzzled why kube-proxy would do this, but a fresh search of The Internet for information turned up KEP-1669 and a Kubernetes blog post that said the behaviour changed in Kubernetes v1.26 (December 2022):

Starting in Kubernetes v1.26, kube-proxy enables the ProxyTerminatingEndpoints feature by default, which adds automatic failover and routing to terminating endpoints in scenarios where the traffic would otherwise be dropped.

Fail open

Now I get it! From Kubernetes 1.26, kube-proxy has a fail open mechanism to try and not drop traffic if it can help it. This is reminiscent of the behaviour of AWS ELBv2 (better known as ALB and NLB).

If kube-proxy has pods that are running and ready it will send traffic to those pods. However if there are zero pods in the running state, but some pods in the terminating state which are ready, it will send traffic to to these pods instead of dropping the traffic.

This can be confusing when you see that the EndpointSlice object for a terminating pod has the ready condition set to false, but you need to keep in mind that kube-proxy cares about the serving condition instead.

Inspecting an EndpointSlice object for a terminating pod that is still ready shows these conditions:

# kubectl get endpointslice -n the-namespace the-app-qgvs7 -o yaml

apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
  - 10.104.88.233
  conditions:
    ready: false
    serving: true
    terminating: true

Inspecting an EndpointSlice object for a terminating pod that is not ready shows these conditions:

# kubectl get endpointslice -n the-namespace the-app-qgvs7 -o yaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
  - 10.104.88.233
  conditions:
    ready: false
    serving: false
    terminating: true

Things change

So there you go! Over the years 2020 - 2022, the behaviour of Kubernetes Services, EndpointSlices and kube-proxy has changed. It confused me & my team for a while, and it took longer to find the answer to why traffic was arriving at terminating pods than I expected. This blog post is an attempt to distribute the knowledge of this behaviour and I hope that it gets picked up into the indexes of search engines to shortcut the learning process for others!

Minutiae

A few other random facts (as at May 2024!):