Migrating Cilium routing from iptables to eBPF... hot!

Context

It could be that I configured production Kubernetes clusters with cilium in iptables mode and not eBPF mode. But you have no proof…

However, in the highly unlikely hypothesis that I could have made such a blunder, here’s how I would have gone about solving the problem.

#trollface

So let’s imagine that when checking the Cilium configuration, here’s what you found:

kubectl -n cilium exec -it cilium-aaaaa -- cilium status
[...]
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
[...]

Damn!

Looking more closely, the cilium-agent containers in your cilium pods are using a lot of CPU and RAM and are starting to clog up your workers…

This is a disaster.

Why?

The first implementations of the Kubernetes imaginary network (the famous CNI plugins) used iptables. However, we’ve known for a very long time, especially in the Kubernetes use case, that iptables is pretty bad at handling large quantities of rules.

In the particular case of Kubernetes, the number of rules tends to increase exponentially with cluster size, and the CPU consumed to route network packets with it…

At some point, traversing the rules list takes all the CPU of a given node and makes it unresponsive.

Hello boss? We’re in trouble.

This is one of the reasons why I really like cilium as a CNI plugin - the developers were among the first to bet on eBPF as a replacement for iptables (although there are other implementations / other technologies that solve this problem).

How did we get here?

To be honest, I just read the docs and trusted them…

We introduced eBPF-based host-routing in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve a faster network namespace switch compared to regular veth device operation. This option is automatically enabled if your kernel supports it. To validate whether your installation is running with eBPF host-routing, run cilium status in any of the Cilium pods and look for the line reporting the status for “Host Routing” which should state “BPF”.

Basically, according to the official docs, if you have the right kernel and the right modules, the cilium installation is supposed to automatically enable eBPF mode… Except that we can clearly see above that this isn’t the case, despite a recent kernel (6.2).

Digging a little deeper into the logs, here’s what you can find:

kubectl -n cilium logs cilium-7b5cp
[...]
level=info msg="BPF host routing requires enable-bpf-masquerade. Falling back to legacy host routing (enable-host-legacy-routing=true)." subsys=daemon

Well… apparently, an option is missing in the Helm chart.

[...]
  kubeProxyReplacement: strict
+ bpf:
+  masquerade: true
[...]

Problem solved, end of article?

Problem

Yes, of course, we can modify the chart like a bull in a china shop and end of story.

But let’s say we don’t want to cut production traffic… How do we do it?

The cleanest solution that comes to mind is the following:

we drain a node
we change its configuration
we uncordon it to put some traffic back on it
we check that the new configuration works AND
we check that nodes can talk to each other between those with iptables and those with eBPF

Because yeah, it would be a shame to have half the cluster unable to communicate with the other half…

Checking connectivity

What’s cool about cilium (did I tell you I love cilium?) is that we already have tooling to test everything.

The cilium CLI has a cilium connectivity test subcommand that launches dozens of internal/external tests to check that everything is OK.

➜  ~ cilium -n cilium connectivity test
ℹ️  Monitor aggregation detected, will skip some flow validation steps
✨ [node] Deploying echo-same-node service...
✨ [node] Deploying DNS test server configmap...
✨ [node] Deploying same-node deployment...
✨ [node] Deploying client deployment...
[...]
✅ All 32 tests (265 actions) successful, 2 tests skipped, 0 scenarios skipped.

There’s also a cilium-health command, embedded in the cilium-agent container, which allows you to get periodic latency statistics between all the cluster nodes. Useful!

kubectl -n cilium exec -it cilium-ccccc -c cilium-agent -- cilium-health status
Probe time:   2023-09-12T14:47:03Z
Nodes:
  node-02 (localhost):
    Host connectivity to 172.31.0.152:
      ICMP to stack:   OK, RTT=860.424µs
      HTTP to agent:   OK, RTT=110.142µs
    Endpoint connectivity to 10.0.2.56:
      ICMP to stack:   OK, RTT=783.861µs
      HTTP to agent:   OK, RTT=256.419µs
  node-01:
    Host connectivity to 172.31.0.151:
      ICMP to stack:   OK, RTT=813.324µs
      HTTP to agent:   OK, RTT=553.445µs
    Endpoint connectivity to 10.0.1.53:
      ICMP to stack:   OK, RTT=865.976µs
      HTTP to agent:   OK, RTT=3.440655ms
[...]

And finally, we can just look at the cilium status command which tells us who is “reachable” (again, in the cilium-agent container)

➜  kubectl -n cilium exec -ti cilium-ddddd -- cilium status
[...]
Host Routing:            Legacy
Masquerading:            IPTables [IPv4: Enabled, IPv6: Disabled]
[...]
Cluster health:          5/5 reachable   (2023-09-14T12:01:51Z)

Changing a node’s configuration

We’re in luck, because since cilium version 1.13 (the latest to date is 1.14), it’s possible to apply different configurations to a subset of nodes (official documentation for Per-node configuration).

Note: before that, it was still possible, temporarily, by manually editing the cilium ConfigMap, then restarting the concerned pods for it to take effect.

There’s now a CRD to do this, called CiliumNodeConfig. The only things we have to do are add a label to a node (io.cilium.enable-ebpf: “true”) and add the following manifest to our cluster:

cat > cilium-fix.yaml << EOF
apiVersion: cilium.io/v2alpha1
kind: CiliumNodeConfig
metadata:
  namespace: cilium
  name: cilium-switch-from-iptables-ebpf
spec:
  nodeSelector:
    matchLabels:
      io.cilium.enable-ebpf: "true"
  defaults:
    enable-bpf-masquerade: "true"
EOF

kubectl apply -f cilium-fix.yaml

kubectl label node node-05 --overwrite 'io.cilium.enable-ebpf=true'

If you’re in production, the cleanest approach is to kubectl drain the Node beforehand.

Out of curiosity, I still tried to do it “hot”, for fun.

I deployed a daemonset containing a V(lang) app that simply returns the container name in a web page:

cat > vhelloworld-daemonset.yaml << EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: vhelloworld-daemonset
spec:
  selector:
    matchLabels:
      app: vhelloworld
  template:
    metadata:
      labels:
        app: vhelloworld
    spec:
      containers:
      - name: vhelloworld
        image: zwindler/vhelloworld:latest
        ports:
        - containerPort: 8081
        imagePullPolicy: Always
EOF

kubectl apply -f vhelloworld-daemonset.yaml

kubectl get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
vhelloworld-daemonset-q4h44   1/1     Running   0          3m31s   10.0.4.87   nodekube-05   <none>           <none>
vhelloworld-daemonset-w97pv   1/1     Running   0          3m31s   10.0.3.85   nodekube-04   <none>           <none>

Then I created containers with a shell and curl to periodically check from multiple nodes that the containers were accessible:

kubectl run -it --image curlimages/curl:latest curler -- /bin/sh
If you don't see a command prompt, try pressing enter.
~ $ curl http://10.0.4.87:8081
hello from vhelloworld-daemonset-g2zdw
~ $ curl http://10.0.3.85:8081
hello from vhelloworld-daemonset-65k2n

~ $ while true; do 
  date
  curl http://10.0.4.87:8081
  curl http://10.0.3.85:8081
  echo
  sleep 1
done

Once the label was applied to a node and its cilium pod killed (via kubectl delete), the pods remained accessible:

Fri Sep 15 14:05:34 UTC 2023
hello from vhelloworld-daemonset-g2zdw
hello from vhelloworld-daemonset-65k2n

Fri Sep 15 14:05:35 UTC 2023
hello from vhelloworld-daemonset-g2zdw
hello from vhelloworld-daemonset-65k2n

Fri Sep 15 14:05:36 UTC 2023
hello from vhelloworld-daemonset-g2zdw
hello from vhelloworld-daemonset-65k2n

At some point, cilium restarted, noticed the presence of existing pods, and took over with eBPF!

evel=info msg="Rewrote endpoint BPF program" containerID=8b7be1b032 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1018 identity=4773 ipv4=10.0.3.85 ipv6= k8sPodName=default/vhelloworld-daemonset-w97pv subsys=endpoint
level=info msg="Restored endpoint" endpointID=1018 ipAddr="[10.0.3.85 ]" subsys=endpoint

Does it work, but is it better in terms of resource consumption?

That was the good news. I was expecting gains because the cluster was really close to the dreaded death by iptables.

The cilium containers were using 10% of my nodes’ CPU, and several GB of RAM in iptables mode. It could quickly have gone much higher if I had grown the cluster beyond 50 nodes / 2000 pods.

Switching to eBPF mode allowed us to return to totally painless levels (1% CPU per node, 1-2% RAM) compared to my machines’ configurations.

Not bad, huh?

Bonus

A really hairy tutorial to fully migrate CNI hot (from flannel to cilium)