Kyverno killed my API Server. Again.

The revenge strikes back

You may have read my previous article about etcd 3 years ago. If you remember correctly, the crash was etcd, but the real culprit was Kyverno. I love Kyverno. It’s really a piece of software I’m very fond of. Mainly because it’s incredibly powerful. I even wrote an introductory article and a second one that goes deeper on the topic (both in french, though).

But the sheer number of incidents and weird side effects it causes. Mamamia… This isn’t the first incident of the year I’ve had with Kyverno (yes, we’re in February) but since this one is entertaining, I’m sharing it with you.

During a routine maintenance operation to upgrade a Kubernetes cluster to version 1.34 (from 1.32), we ended up facing the dreaded scenario for any kube admin: a completely unreachable API Server after restarting the Control Plane nodes.

What initially looked like a typical network error turned out to be a subtle deadlock between new native Kubernetes networking features and our dear Kyverno 😘.

Spoiler: it wasn’t a network issue. It’s never a network issue. Well, sometimes it is. But not this time.

The upgrade that starts well

Alright, a Kubernetes upgrade has become pretty routine at this point. We do it regularly, we have our procedures, we’re pros (I swear). We jump from 1.32 to 1.34 in a single commit, skipping the hop through 1.33.

YOLO.

In the technical context I’m talking about, everything is managed as code. From machine provisioning all the way to Talos deployment, including MachineConfigs (the CustomResources to modify… well, the machine).

For more details, see the Talos documentation on Machine Configs.

The first cluster we test has only one control plane (don’t ask me why, it probably wouldn’t have changed anything). Talos restarts the API Server with the new version and then… nothing.

The “weird” API Server logs (technical term) speak for themselves:

I0224 15:32:50.979280  1 default_servicecidr_controller.go:166] Creating default ServiceCIDR with CIDRs: [10.1.0.0/20]
W0224 15:32:50.984784  1 dispatcher.go:225] rejected by webhook "validate.kyverno.svc-fail":
  admission webhook "validate.kyverno.svc-fail" denied the request:
  Get "https://10.1.0.1:443/api": dial tcp 10.1.0.1:443: connect: operation not permitted
I0224 15:32:50.985342  1 event.go:389] "Event occurred" kind="ServiceCIDR"
  apiVersion="networking.k8s.io/v1" type="Warning"
  reason="KubernetesDefaultServiceCIDRError"
  message="The default ServiceCIDR can not be created"

😬😬😬

The root cause: a magnificent vicious circle

After investigation, we discovered that the incident was the result of a collision between a Kubernetes core evolution and our Kyverno configuration. A textbook deadlock case.

Let’s break down the mechanism:

The new ServiceCIDR Kind

In recent versions (v1.33+), Kubernetes migrates service IP range management to dedicated objects named ServiceCIDR. On the first boot after the upgrade, the API Server automatically tries to create the default object (e.g., 10.1.0.0/20).

For the curious, KEP-1880 and the official ServiceCIDR documentation detail this evolution.

It’s new, it’s clean, it’s well designed. Except that…

Interception by the Kyverno Webhook

Kyverno, configured with failurePolicy: Fail (because we’re serious people who don’t let just anything through in prod), is set up to intercept resource creations to validate them, and fail the call if Kyverno doesn’t respond.

Including the ServiceCIDR freshly created by the API Server itself.

Deadlock

And this is where it gets beautiful:

The API Server pauses the ServiceCIDR creation waiting for Kyverno’s “OK”
To contact the Kyverno service, the API Server needs to route the request through the Kubernetes service IP (typically 10.1.0.1)
But the network layer (service routing) can’t initialize until the ServiceCIDR object is validated and created

It’s the chicken and the egg, “I locked my keys inside the car” edition.

PTSD. Yes, that actually happened to me. In the desert. With no cell service.

Profit.

The API Server times out or returns a connect: operation not permitted error when trying to reach the webhook, blocking its own initialization. CrashLoopBackOff on the API Server. :D

Breaking out of the deadlock

To escape this deadlock, you need to temporarily bypass the admission layer. Easy, right?

The “usual” workaround: useless

Deadlocks with Kyverno, we’re used to them at this point. Normally, since kube-system is ignored, you can simply connect with a break-glass kubeconfig (we normally use OIDC) that has the cluster-admin cluster role and delete the Kyverno validating webhooks:

kubectl delete validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg

Except here, the API Server won’t even start. My kubectl isn’t going to work, obviously!

The real workaround: disable webhooks at boot

The solution we chose was to modify the API Server configuration to temporarily disable validation webhooks at startup. My esteemed colleague Maxime hot-edited the machine config (using break-glass talosctl access) to add the following flag directly in the API server’s extraArgs:

--disable-admission-plugins=ValidatingAdmissionWebhook

For those unfamiliar with admission control in Kubernetes, just know that there’s a list of “default” plugins but everything can be toggled off. I might do a deep dive on Kubernetes admission control someday, it’s fascinating ;).

With this flag, the API Server can finally create its ServiceCIDR objects without asking anyone for permission (completely bypassing all validation mechanisms that Kyverno or similar tools enforce), the network initializes, Kyverno starts, and then you can remove the flag and restart cleanly.

The “funny” option we didn’t try

Personally, I thought it would be hilarious to go directly into the etcd database and delete the webhook key causing the issue (also through talosctl):

# Example via etcdctl
etcdctl del /registry/admissionregistration.k8s.io/validatingwebhookconfigurations/kyverno-resource-validating-webhook-cfg

My colleagues were less enthusiastic: “Yeah but you know, if we break etcd it’s gonna be painful”. We played it safe with the flag. I’m deeply disappointed we didn’t try 😂.

The permanent fix: MatchConditions

OK, now that the cluster is back up, how do we make sure this doesn’t happen again on the next upgrade?

The clean solution is to use matchConditions (introduced in Kubernetes 1.27) on the ValidatingWebhookConfiguration. This allows you to exclude critical network bootstrap resources before the request even attempts to leave the API Server toward the Kyverno pod.

See the official documentation on matchConditions.

We were already using this option to throttle the sometimes excessive Kyverno traffic (if you manage Kyverno, you know what I’m talking about) on a number of events (we’d overwhelm the API server or Kyverno, in CPU or RAM, depending on the case). We just had to add exclusions for the new types:

# Exclude network bootstrap resources to prevent the deadlock
matchConditions:
  - name: 'exclude-ServiceCIDR'
    expression: '!(request.kind.kind == "ServiceCIDR")'
  - name: 'exclude-IPAddress'
    expression: '!(request.kind.kind == "IPAddress")'

With this, when the API Server creates a ServiceCIDR at boot, the request no longer goes through the Kyverno webhook. No circular dependency, no deadlock, everyone’s happy.

Conclusion

As the current French president would say about something that was painfully predictable:

who could have predicted this?

OK fine, all we had to do was read the Kubernetes 1.33 release notes. That said, we have a staging cluster, that’s what it’s for. We broke staging, no big deal.

Context: this is the “Big deal” TV Game Mascot

Maybe we’ll actually read them next time?