<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Webhook on Zwindler's Reflection</title><link>https://blog.zwindler.fr/en/tags/webhook/</link><description>Recent content in Webhook on Zwindler's Reflection</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>Licensed under CC BY-SA 4.0</copyright><lastBuildDate>Thu, 26 Feb 2026 08:00:00 +0200</lastBuildDate><atom:link href="https://blog.zwindler.fr/en/tags/webhook/index.xml" rel="self" type="application/rss+xml"/><item><title>Kyverno killed my API Server. Again.</title><link>https://blog.zwindler.fr/en/2026/02/26/kyverno-killed-my-api-server.-again./</link><pubDate>Thu, 26 Feb 2026 08:00:00 +0200</pubDate><guid>https://blog.zwindler.fr/en/2026/02/26/kyverno-killed-my-api-server.-again./</guid><description>&lt;img src="https://blog.zwindler.fr/2026/02/0_days_without_kyverno.webp" alt="Featured image of post Kyverno killed my API Server. Again." /&gt;&lt;h2 id="the-revenge-strikes-back"&gt;The revenge strikes back
&lt;/h2&gt;&lt;p&gt;You may have read &lt;a class="link" href="https://blog.zwindler.fr/en/2023/11/30/kubernetes-error-etcdserver-mvcc-database-space-exceeded/" &gt;my previous article about etcd 3 years ago&lt;/a&gt;. If you remember correctly, the crash was etcd, but the real culprit was Kyverno. I love Kyverno. It&amp;rsquo;s really a piece of software I&amp;rsquo;m very fond of. Mainly because it&amp;rsquo;s incredibly powerful. I even wrote &lt;a class="link" href="https://blog.zwindler.fr/2022/08/01/vos-politiques-de-conformite-sur-kubernetes-avec-kyverno/" &gt;an introductory article&lt;/a&gt; and &lt;a class="link" href="https://blog.zwindler.fr/2022/09/05/vos-politiques-de-conformite-sur-kubernetes-avec-kyverno-part2/" &gt;a second one that goes deeper&lt;/a&gt; on the topic (both in french, though).&lt;/p&gt;
&lt;p&gt;But the sheer number of incidents and weird side effects it causes. Mamamia&amp;hellip; This isn&amp;rsquo;t the first incident of the year I&amp;rsquo;ve had with Kyverno (yes, we&amp;rsquo;re in February) but since this one is entertaining, I&amp;rsquo;m sharing it with you.&lt;/p&gt;
&lt;p&gt;During a routine maintenance operation to upgrade a Kubernetes cluster to version &lt;strong&gt;1.34&lt;/strong&gt; (from 1.32), we ended up facing the dreaded scenario for any kube admin: a completely unreachable API Server after restarting the Control Plane nodes.&lt;/p&gt;
&lt;p&gt;What initially looked like a typical network error turned out to be a subtle &lt;strong&gt;deadlock&lt;/strong&gt; between new native Kubernetes networking features and our dear Kyverno 😘.&lt;/p&gt;
&lt;p&gt;Spoiler: it wasn&amp;rsquo;t a network issue. It&amp;rsquo;s never a network issue. Well, sometimes it is. But not this time.&lt;/p&gt;
&lt;h2 id="the-upgrade-that-starts-well"&gt;The upgrade that starts well
&lt;/h2&gt;&lt;p&gt;Alright, a Kubernetes upgrade has become pretty routine at this point. We do it regularly, we have our procedures, we&amp;rsquo;re pros (I swear). We jump from 1.32 to 1.34 in a single commit, skipping the hop through 1.33.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;YOLO.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;In the technical context I&amp;rsquo;m talking about, everything is managed as code. From machine provisioning all the way to Talos deployment, including MachineConfigs (the CustomResources to modify&amp;hellip; well, the machine).&lt;/p&gt;
&lt;p&gt;For more details, see &lt;a class="link" href="https://docs.siderolabs.com/talos/v1.12/reference/configuration/v1alpha1/config#machineconfig" target="_blank" rel="noopener"
&gt;the Talos documentation on Machine Configs&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The first cluster we test has only one control plane (don&amp;rsquo;t ask me why, it probably wouldn&amp;rsquo;t have changed anything). Talos restarts the API Server with the new version and then&amp;hellip; nothing.&lt;/p&gt;
&lt;p&gt;The &amp;ldquo;weird&amp;rdquo; API Server logs (technical term) speak for themselves:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code class="language-log" data-lang="log"&gt;I0224 15:32:50.979280 1 default_servicecidr_controller.go:166] Creating default ServiceCIDR with CIDRs: [10.1.0.0/20]
W0224 15:32:50.984784 1 dispatcher.go:225] rejected by webhook &amp;#34;validate.kyverno.svc-fail&amp;#34;:
admission webhook &amp;#34;validate.kyverno.svc-fail&amp;#34; denied the request:
Get &amp;#34;https://10.1.0.1:443/api&amp;#34;: dial tcp 10.1.0.1:443: connect: operation not permitted
I0224 15:32:50.985342 1 event.go:389] &amp;#34;Event occurred&amp;#34; kind=&amp;#34;ServiceCIDR&amp;#34;
apiVersion=&amp;#34;networking.k8s.io/v1&amp;#34; type=&amp;#34;Warning&amp;#34;
reason=&amp;#34;KubernetesDefaultServiceCIDRError&amp;#34;
message=&amp;#34;The default ServiceCIDR can not be created&amp;#34;
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;😬😬😬&lt;/p&gt;
&lt;h2 id="the-root-cause-a-magnificent-vicious-circle"&gt;The root cause: a magnificent vicious circle
&lt;/h2&gt;&lt;p&gt;After investigation, we discovered that the incident was the result of a collision between a Kubernetes core evolution and our Kyverno configuration. A textbook deadlock case.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s break down the mechanism:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The new &lt;code&gt;ServiceCIDR&lt;/code&gt; Kind&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;In recent versions (v1.33+), Kubernetes migrates service IP range management to dedicated objects named &lt;code&gt;ServiceCIDR&lt;/code&gt;. On the first boot after the upgrade, the API Server automatically tries to create the default object (e.g., &lt;code&gt;10.1.0.0/20&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;For the curious, &lt;a class="link" href="https://github.com/kubernetes/enhancements/issues/1880" target="_blank" rel="noopener"
&gt;KEP-1880&lt;/a&gt; and the &lt;a class="link" href="https://kubernetes.io/docs/reference/kubernetes-api/cluster-resources/service-cidr-v1/" target="_blank" rel="noopener"
&gt;official ServiceCIDR documentation&lt;/a&gt; detail this evolution.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s new, it&amp;rsquo;s clean, it&amp;rsquo;s well designed. Except that&amp;hellip;&lt;/p&gt;
&lt;ol start="2"&gt;
&lt;li&gt;Interception by the Kyverno Webhook&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Kyverno, configured with &lt;code&gt;failurePolicy: Fail&lt;/code&gt; (because we&amp;rsquo;re serious people who don&amp;rsquo;t let just anything through in prod), is set up to intercept resource creations to validate them, and &lt;strong&gt;fail the call if Kyverno doesn&amp;rsquo;t respond&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Including the &lt;code&gt;ServiceCIDR&lt;/code&gt; freshly created by the API Server itself.&lt;/p&gt;
&lt;ol start="3"&gt;
&lt;li&gt;Deadlock&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And this is where it gets beautiful:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The API Server pauses the &lt;code&gt;ServiceCIDR&lt;/code&gt; creation waiting for Kyverno&amp;rsquo;s &amp;ldquo;OK&amp;rdquo;&lt;/li&gt;
&lt;li&gt;To contact the Kyverno service, the API Server needs to route the request through the Kubernetes service IP (typically &lt;code&gt;10.1.0.1&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;But&lt;/strong&gt; the network layer (service routing) can&amp;rsquo;t initialize until the &lt;code&gt;ServiceCIDR&lt;/code&gt; object is validated and created&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It&amp;rsquo;s the chicken and the egg, &amp;ldquo;I locked my keys inside the car&amp;rdquo; edition.&lt;/p&gt;
&lt;p&gt;PTSD. Yes, that actually happened to me. In the desert. With no cell service.&lt;/p&gt;
&lt;ol start="4"&gt;
&lt;li&gt;Profit.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The API Server times out or returns a &lt;code&gt;connect: operation not permitted&lt;/code&gt; error when trying to reach the webhook, blocking its own initialization. CrashLoopBackOff on the API Server. :D&lt;/p&gt;
&lt;h2 id="breaking-out-of-the-deadlock"&gt;Breaking out of the deadlock
&lt;/h2&gt;&lt;p&gt;To escape this deadlock, you need to temporarily bypass the admission layer. Easy, right?&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &amp;ldquo;usual&amp;rdquo; workaround: useless&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Deadlocks with Kyverno, we&amp;rsquo;re used to them at this point. Normally, since &lt;code&gt;kube-system&lt;/code&gt; is ignored, you can simply connect with a break-glass kubeconfig (we normally use OIDC) that has the cluster-admin cluster role and delete the Kyverno validating webhooks:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;kubectl delete validatingwebhookconfiguration kyverno-resource-validating-webhook-cfg
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;&lt;strong&gt;Except here&lt;/strong&gt;, the API Server won&amp;rsquo;t even start. My &lt;code&gt;kubectl&lt;/code&gt; isn&amp;rsquo;t going to work, obviously!&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The real workaround: disable webhooks at boot&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The solution we chose was to modify the API Server configuration to temporarily disable validation webhooks at startup. My esteemed colleague Maxime hot-edited the machine config (using break-glass &lt;code&gt;talosctl&lt;/code&gt; access) &lt;a class="link" href="https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#how-do-i-turn-off-an-admission-controller" target="_blank" rel="noopener"
&gt;to add the following flag&lt;/a&gt; directly in the API server&amp;rsquo;s &lt;code&gt;extraArgs&lt;/code&gt;:&lt;/p&gt;
&lt;pre tabindex="0"&gt;&lt;code&gt;--disable-admission-plugins=ValidatingAdmissionWebhook
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;For those unfamiliar with admission control in Kubernetes, just know that there&amp;rsquo;s a list of &amp;ldquo;default&amp;rdquo; plugins but everything can be toggled off. I might do a deep dive on Kubernetes admission control someday, it&amp;rsquo;s fascinating ;).&lt;/p&gt;
&lt;p&gt;With this flag, the API Server can finally create its &lt;code&gt;ServiceCIDR&lt;/code&gt; objects without asking anyone for permission (completely bypassing all validation mechanisms that Kyverno or similar tools &lt;em&gt;enforce&lt;/em&gt;), the network initializes, Kyverno starts, and then you can remove the flag and restart cleanly.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The &amp;ldquo;funny&amp;rdquo; option we didn&amp;rsquo;t try&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Personally, I thought it would be hilarious to go directly into the etcd database and delete the webhook key causing the issue (also through &lt;code&gt;talosctl&lt;/code&gt;):&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c1"&gt;# Example via etcdctl&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;etcdctl del /registry/admissionregistration.k8s.io/validatingwebhookconfigurations/kyverno-resource-validating-webhook-cfg
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;My colleagues were less enthusiastic: &amp;ldquo;Yeah but you know, if we break etcd it&amp;rsquo;s gonna be painful&amp;rdquo;. We played it safe with the flag. I&amp;rsquo;m deeply disappointed we didn&amp;rsquo;t try 😂.&lt;/p&gt;
&lt;h2 id="the-permanent-fix-matchconditions"&gt;The permanent fix: MatchConditions
&lt;/h2&gt;&lt;p&gt;OK, now that the cluster is back up, how do we make sure this doesn&amp;rsquo;t happen again on the next upgrade?&lt;/p&gt;
&lt;p&gt;The clean solution is to use &lt;code&gt;matchConditions&lt;/code&gt; (introduced in Kubernetes 1.27) on the &lt;code&gt;ValidatingWebhookConfiguration&lt;/code&gt;. This allows you to exclude critical network bootstrap resources &lt;strong&gt;before&lt;/strong&gt; the request even attempts to leave the API Server toward the Kyverno pod.&lt;/p&gt;
&lt;p&gt;See &lt;a class="link" href="https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-matchconditions" target="_blank" rel="noopener"
&gt;the official documentation on matchConditions&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We were already using this option to throttle the sometimes excessive Kyverno traffic (if you manage Kyverno, you know what I&amp;rsquo;m talking about) on a number of events (we&amp;rsquo;d overwhelm the API server or Kyverno, in CPU or RAM, depending on the case). We just had to add exclusions for the new types:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-yaml" data-lang="yaml"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="c"&gt;# Exclude network bootstrap resources to prevent the deadlock&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="nt"&gt;matchConditions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;exclude-ServiceCIDR&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;!(request.kind.kind == &amp;#34;ServiceCIDR&amp;#34;)&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;- &lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;exclude-IPAddress&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;!(request.kind.kind == &amp;#34;IPAddress&amp;#34;)&amp;#39;&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With this, when the API Server creates a &lt;code&gt;ServiceCIDR&lt;/code&gt; at boot, the request no longer goes through the Kyverno webhook. No circular dependency, no deadlock, everyone&amp;rsquo;s happy.&lt;/p&gt;
&lt;h2 id="conclusion"&gt;Conclusion
&lt;/h2&gt;&lt;p&gt;As the current French president would say about something that was painfully predictable:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;who could have predicted this?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;OK fine, all we had to do was read the Kubernetes 1.33 release notes. That said, we have a staging cluster, that&amp;rsquo;s what it&amp;rsquo;s for. We broke staging, no big deal.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://blog.zwindler.fr/2026/02/bigdeal.avif"
loading="lazy"
&gt;&lt;/p&gt;
&lt;p&gt;Context: this is the &amp;ldquo;Big deal&amp;rdquo; TV Game Mascot&lt;/p&gt;
&lt;p&gt;Maybe we&amp;rsquo;ll actually read them next time?&lt;/p&gt;</description></item></channel></rss>