Context
It could be that I configured production Kubernetes clusters with cilium in iptables mode and not eBPF mode. But you have no proof…
However, in the highly unlikely hypothesis that I could have made such a blunder, here’s how I would have gone about solving the problem.
#trollface
So let’s imagine that when checking the Cilium configuration, here’s what you found:
kubectl -n cilium exec -it cilium-aaaaa -- cilium status
[...]
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
[...]
Damn!
Looking more closely, the cilium-agent containers in your cilium pods are using a lot of CPU and RAM and are starting to clog up your workers…
This is a disaster.
Why?
The first implementations of the Kubernetes imaginary network (the famous CNI plugins) used iptables. However, we’ve known for a very long time, especially in the Kubernetes use case, that iptables is pretty bad at handling large quantities of rules.
In the particular case of Kubernetes, the number of rules tends to increase exponentially with cluster size, and the CPU consumed to route network packets with it…
At some point, traversing the rules list takes all the CPU of a given node and makes it unresponsive.
Hello boss? We’re in trouble.
This is one of the reasons why I really like cilium as a CNI plugin - the developers were among the first to bet on eBPF as a replacement for iptables (although there are other implementations / other technologies that solve this problem).
How did we get here?
To be honest, I just read the docs and trusted them…
We introduced eBPF-based host-routing in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve a faster network namespace switch compared to regular veth device operation. This option is automatically enabled if your kernel supports it. To validate whether your installation is running with eBPF host-routing, run cilium status in any of the Cilium pods and look for the line reporting the status for “Host Routing” which should state “BPF”.
Basically, according to the official docs, if you have the right kernel and the right modules, the cilium installation is supposed to automatically enable eBPF mode… Except that we can clearly see above that this isn’t the case, despite a recent kernel (6.2).
Digging a little deeper into the logs, here’s what you can find:
kubectl -n cilium logs cilium-7b5cp
[...]
level=info msg="BPF host routing requires enable-bpf-masquerade. Falling back to legacy host routing (enable-host-legacy-routing=true)." subsys=daemon
Well… apparently, an option is missing in the Helm chart.
[...]
kubeProxyReplacement: strict
+ bpf:
+ masquerade: true
[...]
Problem solved, end of article?
Problem
Yes, of course, we can modify the chart like a bull in a china shop and end of story.
But let’s say we don’t want to cut production traffic… How do we do it?
The cleanest solution that comes to mind is the following:
- we
draina node - we change its configuration
- we
uncordonit to put some traffic back on it - we check that the new configuration works AND
- we check that nodes can talk to each other between those with
iptablesand those witheBPF
Because yeah, it would be a shame to have half the cluster unable to communicate with the other half…
Checking connectivity
What’s cool about cilium (did I tell you I love cilium?) is that we already have tooling to test everything.
The cilium CLI has a cilium connectivity test subcommand that launches dozens of internal/external tests to check that everything is OK.
➜ ~ cilium -n cilium connectivity test
ℹ️ Monitor aggregation detected, will skip some flow validation steps
✨ [node] Deploying echo-same-node service...
✨ [node] Deploying DNS test server configmap...
✨ [node] Deploying same-node deployment...
✨ [node] Deploying client deployment...
[...]
✅ All 32 tests (265 actions) successful, 2 tests skipped, 0 scenarios skipped.
There’s also a cilium-health command, embedded in the cilium-agent container, which allows you to get periodic latency statistics between all the cluster nodes. Useful!
kubectl -n cilium exec -it cilium-ccccc -c cilium-agent -- cilium-health status
Probe time: 2023-09-12T14:47:03Z
Nodes:
node-02 (localhost):
Host connectivity to 172.31.0.152:
ICMP to stack: OK, RTT=860.424µs
HTTP to agent: OK, RTT=110.142µs
Endpoint connectivity to 10.0.2.56:
ICMP to stack: OK, RTT=783.861µs
HTTP to agent: OK, RTT=256.419µs
node-01:
Host connectivity to 172.31.0.151:
ICMP to stack: OK, RTT=813.324µs
HTTP to agent: OK, RTT=553.445µs
Endpoint connectivity to 10.0.1.53:
ICMP to stack: OK, RTT=865.976µs
HTTP to agent: OK, RTT=3.440655ms
[...]
And finally, we can just look at the cilium status command which tells us who is “reachable” (again, in the cilium-agent container)
➜ kubectl -n cilium exec -ti cilium-ddddd -- cilium status
[...]
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
[...]
Cluster health: 5/5 reachable (2023-09-14T12:01:51Z)
Changing a node’s configuration
We’re in luck, because since cilium version 1.13 (the latest to date is 1.14), it’s possible to apply different configurations to a subset of nodes (official documentation for Per-node configuration).
Note: before that, it was still possible, temporarily, by manually editing the cilium ConfigMap, then restarting the concerned pods for it to take effect.
There’s now a CRD to do this, called CiliumNodeConfig. The only things we have to do are add a label to a node (io.cilium.enable-ebpf: “true”) and add the following manifest to our cluster:
cat > cilium-fix.yaml << EOF
apiVersion: cilium.io/v2alpha1
kind: CiliumNodeConfig
metadata:
namespace: cilium
name: cilium-switch-from-iptables-ebpf
spec:
nodeSelector:
matchLabels:
io.cilium.enable-ebpf: "true"
defaults:
enable-bpf-masquerade: "true"
EOF
kubectl apply -f cilium-fix.yaml
kubectl label node node-05 --overwrite 'io.cilium.enable-ebpf=true'
If you’re in production, the cleanest approach is to kubectl drain the Node beforehand.
Out of curiosity, I still tried to do it “hot”, for fun.
I deployed a daemonset containing a V(lang) app that simply returns the container name in a web page:
cat > vhelloworld-daemonset.yaml << EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: vhelloworld-daemonset
spec:
selector:
matchLabels:
app: vhelloworld
template:
metadata:
labels:
app: vhelloworld
spec:
containers:
- name: vhelloworld
image: zwindler/vhelloworld:latest
ports:
- containerPort: 8081
imagePullPolicy: Always
EOF
kubectl apply -f vhelloworld-daemonset.yaml
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
vhelloworld-daemonset-q4h44 1/1 Running 0 3m31s 10.0.4.87 nodekube-05 <none> <none>
vhelloworld-daemonset-w97pv 1/1 Running 0 3m31s 10.0.3.85 nodekube-04 <none> <none>
Then I created containers with a shell and curl to periodically check from multiple nodes that the containers were accessible:
kubectl run -it --image curlimages/curl:latest curler -- /bin/sh
If you don't see a command prompt, try pressing enter.
~ $ curl http://10.0.4.87:8081
hello from vhelloworld-daemonset-g2zdw
~ $ curl http://10.0.3.85:8081
hello from vhelloworld-daemonset-65k2n
~ $ while true; do
date
curl http://10.0.4.87:8081
curl http://10.0.3.85:8081
echo
sleep 1
done
Once the label was applied to a node and its cilium pod killed (via kubectl delete), the pods remained accessible:
Fri Sep 15 14:05:34 UTC 2023
hello from vhelloworld-daemonset-g2zdw
hello from vhelloworld-daemonset-65k2n
Fri Sep 15 14:05:35 UTC 2023
hello from vhelloworld-daemonset-g2zdw
hello from vhelloworld-daemonset-65k2n
Fri Sep 15 14:05:36 UTC 2023
hello from vhelloworld-daemonset-g2zdw
hello from vhelloworld-daemonset-65k2n
At some point, cilium restarted, noticed the presence of existing pods, and took over with eBPF!
evel=info msg="Rewrote endpoint BPF program" containerID=8b7be1b032 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1018 identity=4773 ipv4=10.0.3.85 ipv6= k8sPodName=default/vhelloworld-daemonset-w97pv subsys=endpoint
level=info msg="Restored endpoint" endpointID=1018 ipAddr="[10.0.3.85 ]" subsys=endpoint
Does it work, but is it better in terms of resource consumption?
That was the good news. I was expecting gains because the cluster was really close to the dreaded death by iptables.
The cilium containers were using 10% of my nodes’ CPU, and several GB of RAM in iptables mode. It could quickly have gone much higher if I had grown the cluster beyond 50 nodes / 2000 pods.
Switching to eBPF mode allowed us to return to totally painless levels (1% CPU per node, 1-2% RAM) compared to my machines’ configurations.
Not bad, huh?
