Kubernetes on Zwindler's Reflection

Kubernetes 1.34 - Pod-level resources - simplifying resource management when you have lots of containers

Thu, 13 Nov 2025 06:00:00 +0200

Introduction

Today, I’m going to talk about a new Kubernetes 1.34 feature that went beta: Pod-level resources.

And to illustrate this new feature, I’ll use the example of a pod for which I want to have Guaranteed QoS.

For a while, I was convinced (but I don’t know why… because the doc is clear 🤔) that to get Guaranteed QoS, all containers in the pod had to have limits=requests, but the same for all containers (which is false). And that’s super annoying because sometimes you have a big main app and a tiny sidecar and it doesn’t make sense to give them the same values.

If we reread the doc, what matters for Guaranteed QoS (official doc) is that:

Each container must have cpu.limit = cpu.request
Each container must have memory.limit = memory.request

But the values can be different between containers!

To convince yourself, just test this manifest:

apiVersion: v1
kind: Pod
metadata:
 name: ctrlevel-demo1
 namespace: pod-resources-example
spec:
 containers:
 - name: ctrlevel-demo1-ctr-1
 image: nginx
 resources:
 limits:
 cpu: "0.8"
 memory: "100Mi"
 requests:
 cpu: "0.8"
 memory: "100Mi"
 - name: ctrlevel-demo1-ctr-2
 image: fedora
 resources:
 limits:
 cpu: "0.2"
 memory: "200Mi"
 requests:
 cpu: "0.2"
 memory: "200Mi"
 command:
 - sleep
 - inf

And guess what? QoS Class: Guaranteed.

The first container has 0.8 CPU / 100Mi, the second has 0.2 CPU / 200Mi, and it’s no problem as long as each container individually respects limit=request for cpu and ram.

Well. Now that I’ve publicly admitted my mistake (shame shame shame), we’re still going to talk about Pod-level resources, because this feature remains useful.

So, what are Pod-level resources good for?

In some cases, containers (init, sidecar, …) can be very lightweight consumers. Some are also injected on the fly (service mesh, auto instrumentation) in a generalized way. Explicitly specifying limits / requests for these “small” ancillary containers can be tedious, time-consuming, difficult to tune if there are many…

Typical example: we inject a sidecar that just exposes Prometheus metrics. It consumes 10m of CPU and 20Mi of RAM. You could write:

containers:
- name: my-app
 resources:
 limits:
 cpu: "1"
 memory: "100Mi"
 requests:
 cpu: "1"
 memory: "100Mi"
- name: metrics-exporter
 resources:
 limits:
 cpu: "10m"
 memory: "20Mi"
 requests:
 cpu: "10m"
 memory: "20Mi"

But frankly, it’s a pain. And if tomorrow the exporter needs a bit more memory in some cases but not all, you have to modify the manifest to increase it everywhere, or handle the exception…

With Pod-level resources, you can do:

#file podlevel-demo1.yaml
apiVersion: v1
kind: Pod
metadata:
 name: podlevel-demo1
 namespace: pod-resources-example
spec:
 resources:
 limits:
 cpu: "1"
 memory: 100Mi
 requests:
 cpu: "1"
 memory: 100Mi
 initContainers:
 - name: sidecar-test
 image: busybox:latest
 command: ["sh", "-c", "while true; do sleep 3600; done"]
 restartPolicy: Always
 containers:
 - name: podlevel-demo1-ctr
 image: vish/stress
 args:
 - -cpus
 - "2"

Here:

The resources are declared at the Pod spec level
The containers themselves DO NOT have resource declarations
The sidecar (declared as an initContainer with restartPolicy: Always, see my previous article on sidecars) shares the Pod’s resources with the container.

In this example, it’s one container and one sidecar, but it could very well have been any other mix of containers, classic init containers, and sidecars.

Let’s test it!

kubectl create namespace pod-resources-example
kubectl apply -f podlevel-demo1.yaml

Now, let’s verify that we got Guaranteed QoS with this oneliner from hell 😈:

kubectl get pod podlevel-demo1 -n pod-resources-example -o jsonpath=$'Pod-level resources:\n Requests: {.spec.resources.requests}\n Limits: {.spec.resources.limits}\n\nContainer podlevel-demo1-ctr:\n Requests: {.spec.containers[0].resources.requests}\n Limits: {.spec.containers[0].resources.limits}\n\nSidecar sidecar-test:\n Requests: {.spec.initContainers[0].resources.requests}\n Limits: {.spec.initContainers[0].resources.limits}\n\nQoS Class: {.status.qosClass}\n'

Result:

Pod-level resources:
Requests: {"cpu":"1","memory":"100Mi"}
Limits: {"cpu":"1","memory":"100Mi"}
Container podlevel-demo1-ctr:
Requests:
Limits:
Sidecar sidecar-test:
Requests:
Limits:
QoS Class: Guaranteed

The individual containers DO NOT have declared resources, but the Pod as a whole has resources. And we still get Guaranteed QoS!

Everything works as expected.

You can also mix pod level and container level

Beyond this example, know that it’s possible to mix container level (classic) with pod level.

Here’s an example inspired by the official documentation:

#file podlevel-demo2.yaml
apiVersion: v1
kind: Pod
metadata:
 name: podlevel-demo2
 namespace: pod-resources-example
spec:
 resources:
 limits:
 cpu: "1"
 memory: "200Mi"
 requests:
 cpu: "1"
 memory: "200Mi"
 containers:
 - name: podlevel-demo2-ctr-1
 image: nginx
 resources:
 limits:
 cpu: "0.5"
 memory: "100Mi"
 requests:
 cpu: "0.5"
 memory: "100Mi"
 - name: podlevel-demo2-ctr-2
 image: fedora
 command:
 - sleep
 - inf

Here, the podlevel-demo2-ctr-1 container specifies its resources, but we also specify resources for the entire pod, and not at all in podlevel-demo2-ctr-2.

For info, this pod’s example is also Guaranteed, despite the absence of limits/requests on podlevel-demo2-ctr-2

kubectl get pod podlevel-demo2 -n pod-resources-example -o jsonpath=$'Pod-level resources:\n Requests: {.spec.resources.requests}\n Limits: {.spec.resources.limits}\n\nContainer podlevel-demo2-ctr-1:\n Requests: {.spec.containers[0].resources.requests}\n Limits: {.spec.containers[0].resources.limits}\n\nContainer podlevel-demo2-ctr-2:\n Requests: {.spec.containers[1].resources.requests}\n Limits: {.spec.containers[1].resources.limits}\n\nQoS Class: {.status.qosClass}\n'

Result:

Pod-level resources:
Requests: {"cpu":"1","memory":"200Mi"}
Limits: {"cpu":"1","memory":"200Mi"}
Container podlevel-demo2-ctr-1:
Requests: {"cpu":"500m","memory":"100Mi"}
Limits: {"cpu":"500m","memory":"100Mi"}
Container podlevel-demo2-ctr-2:
Requests:
Limits:
QoS Class: Guaranteed

Conclusion

Pod-level resources are a feature that simplifies life in certain use cases, especially when using lightweight sidecars or containers injected on the fly.

Is it revolutionary? No. Will it change your life? Probably not. But it can always be useful.

We learn something new every day. Even (especially?) when we mess up. 😌

References

Cilium’s new policy log field: our use case

Mon, 03 Nov 2025 12:00:00 +0200

TL;DR

Cilium 1.18 added a log field to CiliumNetworkPolicies to tag flows with custom labels. Great for filtering out expected blocked traffic from your monitoring dashboards!

But there’s a catch, unrelated to this feature, that made this irrelevant in our use case: you can’t use it with egressDeny + toFQDNs.

AND, there is a bug, that makes the “log” only visible on “allowed” traffic.

Here’s why we ran into this wall and what we learned.

The problem: monitoring all the things (but not too much)

Like any good ops team should, we monitor our Kubernetes cluster network flows using Hubble. We (mostly my colleague Nicolas Nativel) push all AUDIT and DROPPED flows to a dashboard so we can quickly spot when something’s blocked and decide:

Is this legitimate? → Open the flow
Is this suspicious? → Sound the alarm 🚨

This works pretty well… until you start explicitly blocking things that you know should be blocked.

In our case, we wanted to prevent a third-party application from phoning home with its “telemetry” (yeah, let’s call it that 😏). We’re talking about calls to external tracking domains.

The issue? If we just block these flows, they’ll show up as DROPPED in Hubble, trigger our monitoring, and we’ll end up with alerts for something we intentionally blocked.

That’s noise we don’t want.

Enter Cilium 1.18’s policy log field

Good news! Cilium 1.18 introduced exactly what we needed: the ability to add custom log fields to your network policies.

Check out the official announcement.

The idea is simple: you add a log field to your CiliumNetworkPolicy:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
 name: my-policy
spec:
 endpointSelector:
 matchLabels:
 app: my-app
 egress:
 - toFQDNs:
 - matchName: "example.com"
 log:
 value: "my-custom-log-tag"

Then, when you observe flows in Hubble, you can filter them out using CEL (Common Expression Language):

hubble observe \
 --verdict AUDIT \
 --not \
 --cel-expression "(_flow.policy_log.endsWith('my-custom-log-tag'))" \
 --print-raw-filters

Output:

allowlist:
- '{"verdict":["AUDIT"]}'
denylist:
- '{"experimental":{"cel_expression":["(_flow.policy_log.endsWith(''my-custom-log-tag''))"]}}'

Perfect! This is exactly what we need. We can now tag our “expected blocks” and exclude them from our monitoring.

The plan: block telemetry elegantly

Armed with this new feature, we crafted our strategy:

Use egressDeny to explicitly block telemetry domains
Add a custom log field: app-explicit-traffic-blocked
Configure Hubble to filter out flows with this tag
Profit! 🎉

Here’s what we tried:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
 name: app-external-block-policy
 namespace: my-namespace
spec:
 endpointSelector:
 matchLabels:
 app.kubernetes.io/name: my-app
 # note: egressDeny takes precedence over egress rules
 # https://docs.cilium.io/en/stable/security/policy/language/#deny-policies
 egressDeny:
 # Block all external traffic and log it with an arbitrary log field
 # This is used to prevent the app from sending telemetry data externally 
 # without triggering an AUDIT/DROPPED alert
 # feature added in cilium 1.18.0 https://github.com/cilium/cilium/pull/39902
 - toFQDNs:
 - matchPattern: "*.telemetry.example.com"
 toPorts:
 - ports:
 - port: "443"
 protocol: TCP
 - port: "80"
 protocol: TCP
 log:
 value: "app-explicit-traffic-blocked"

This should work, right? We’re using egressDeny (which takes precedence over other allow rules, for legitimate calls, which is good!), and we’re tagging it with our custom log.

Reality check: you can’t have nice things

And then… patatra (as we say in French 🇫🇷).

While reading the Cilium documentation on deny policies, we stumbled upon this little gem:

Deny policies do not support:

policy enforcement at L7, i.e., specifically denying an URL

toFQDNs, i.e., specifically denying traffic to a specific domain name.

Wait, what?

You cannot use toFQDNs with egressDeny. Our entire plan just collapsed 😱.

Why this is a problem

The issue is the precedence model in Cilium:

egressDeny rules take precedence over egress rules (by design, and that’s good!)
But if we use egressDeny without toFQDNs, we have to block by IP or CIDR
These telemetry services probably use dynamic IPs for their endpoints (good luck maintaining a list…)
If we block all 80/443 traffic in egressDeny, we can’t make exceptions for legitimate traffic in egress rules because… deny takes precedence to allow!

We’re stuck between a rock and a hard place:

Use egress with toFQDNs → works, but we can’t deny, only allow other traffic
Use egressDeny with IPs → we’ll be playing whack-a-mole with rotating IP ranges
Use egressDeny to block all 80/443 → we block everything, including legitimate traffic

Potential workarounds

While waiting for Cilium to support toFQDNs in egressDeny policies, here are some alternative approaches you might consider:

Find a way to disable telemetry in the app directly

That’s the best option but sadly not always on the table.

DNS-based blocking

Bend the DNS server to return NXDOMAIN for telemetry domains, like a personal pi-hole server would do with ads. The application will fail to resolve the domain and won’t send data.

Use IP-based egressDeny (with maintenance overhead)

Resolve the telemetry FQDNs to their current IP ranges and block them with egressDeny:

egressDeny:
 - toCIDRSet:
 - cidr: 203.0.113.0/24  # Example telemetry IP range
 toPorts:
 - ports:
 - port: "443"

If the list doesn’t evolve too often, this is a good option.

Ok, but let’s assume there is no legitimate traffic. Can we use the feature to add a log on dropped traffic?

Sadly no, not right now.

There is a bug in this new Cilium feature that only logs the policy_log field on “allowed” flows, not on audit/dropped flows.

Policy log does not work for DROPPED/AUDIT flow

When defining a CiliumNetworkPolicy with the spec.log field configured, I expect the relevant hubble flows to have the policy_log field. It works for allowed flow.

But for denied/audited flow resulting from the rule (implicit or explicit), policy_log is never available.

Note: I observe the same issue with --print-policy-names option of hubble, the k8s:io.cilium.k8s.policy.derived-from label is not set for denied flows (but correctly set for allowed flows).

and a related issue [Hubble CLI] –print-policy-names flag does not do anything opened by someone else.

Since 2 tickets are opened and maintainers have started to acknowledge the issue, we can hope this will be fixed, though.

Conclusion

In our use case, we finally didn’t use this new feature from Cilium, but adding details (and allowing filtering on them as well) is always nice.

A shout-out to my colleague Nicolas Nativel, who did most of the work around CiliumNetworkPolicies, including the dashboards, exploratory work on this feature, and took the time to create the issue on the Cilium repository.

References

93 ways to deploy Kubernetes: I've cataloged (almost) all existing methods

Sun, 02 Nov 2025 18:00:00 +0200

A slightly crazy documentary project

When I started writing my book “Kubernetes: 50 solutions for development workstations and production clusters”, I quickly realized a problem: there are an infinite number of ways to deploy Kubernetes.

Well, not literally infinite, but still… many. Too many?

To structure my book and choose which solutions I would cover, I did what any good nerd would do: I created a spreadsheet.

A very large spreadsheet.

A Google Sheet that currently lists 93 different methods for deploying Kubernetes. Since I don’t like to “waste”, I’m sharing it with you today under CC BY-SA 4.0 (Attribution - Share Alike):

93 ways to deploy Kubernetes

What does this spreadsheet contain?

The spreadsheet is structured with several columns to help you navigate this jungle:

Product name (and publisher when interesting)
Product URL: I tried to restrict to open source products (or public managed services) although there are a few exceptions
Solution type (I’ll come back to this)
“Based on”: this is quite funny… many projects are layers on top of kubeadm, k3s or k0s. Not all of them say it openly, and I realized it by trying them or digging in the code

And maybe a bit less interesting for some of you (maybe it will disappear)

Do I talk about it in my book?
Do I talk about it on my blog?

The different tool categories

To structure first my thinking, and then my book, I tried to classify these methods into categories.

Some are quite obvious (a managed offering, you can immediately see what it’s about), others, a bit more personal (and therefore debatable).

Kubernetes on desktop (Local Development)

Tools for developing locally on your machine. We’re talking about Minikube, kind and other Docker Desktop

Infrastructure as Code (IaC)

Tools that allow you to describe Kubernetes deployment via code (opentofu, crossplane, pulumi…)

Kubernetes in Kubernetes

Because why make it simple when you can make it… recursive? 🤯. It’s currently limited to vCluster and k3k.

Specialized OSes

Operating systems designed specifically to run Kubernetes. I’m obviously thinking of Talos Linux, but not only ;-P.

Managed Kubernetes (turnkey cloud offerings)

No need to draw a picture, we immediately think of the EKS / AKS / GKE triplet, but also French solutions (OVHcloud Managed Kubernetes, soon Clever Cloud :smirk:)

Cluster management platforms

Here, it’s a somewhat separate category, which will allow us to manage many Kubernetes clusters and even generate new clusters managed by clusters… Often good Rube Goldberg machines like Gardener or worse Kubermatic Kubernetes Platform. We still have some slightly more fun things like Kamaji.

Automation tools for self-hosted

Solutions that automate deployment on your own machines, like kubeadm, k3s and k0s (the triplet, basis of about 50% of other market solutions).

The revelation: everyone copies from their neighbor

While filling in this spreadsheet, I discovered something funny: a large majority of tools don’t reinvent the wheel.

Many projects are actually layers or wrappers around three basic solutions I just mentioned (kubeadm, k3s and k0s). And sometimes, we also have layers on slightly more confidential solutions.

The information isn’t always available, I sometimes discovered it in a blog post, or even by digging into the solution’s internals.

This is typically the kind of interesting column to really realize that beyond the apparent diversity of these deployment solutions, there’s actually a big standard and a few variations.

I hope to manage to find more similar clues to fill this column even more :).

A living document

This spreadsheet is not static. The number of tools evolves regularly (I discover new ones almost every week).

If you know a method that’s not listed, don’t hesitate to comment on social networks.

Note: I remind you that I primarily target open source solutions or public managed services (I just removed Mirantis Kubernetes Engine for this reason).

So, what do I still have to test?

I’ve already teased on reputable social networks, in the list of things I haven’t tested yet but that could motivate me, there are:

zeropod: scale-to-zero with container checkpointing

Fri, 20 Jun 2025 19:00:00 +0200

What is zeropod?

I kept the intro paragraph from the project documentation as is because I find it perfect. It says everything you need to know about the tool, neither too much nor too little.

Zeropod is a Kubernetes runtime (more specifically a containerd shim) that automatically checkpoints containers to disk after a certain amount of time of the last TCP connection.

While in scaled down state, it will listen on the same port the application inside the container was listening on and will restore the container on the first incoming connection.

Depending on the memory size of the checkpointed program this happens in tens to a few hundred milliseconds, virtually unnoticable to the user.

As all the memory contents are stored to disk during checkpointing, all state of the application is restored.

It adjusts resource requests in scaled down state in-place if the cluster supports it.

To prevent huge resource usage spikes when draining a node, scaled down pods can be migrated between nodes without needing to start up.

TL;DR: it will freeze your app if it doesn’t receive TCP calls, and restore it when a call arrives.

If you want to understand in more detail HOW it really works, in that case, I invite you to read the “How it works” section of the official project documentation, which has the merit of being quite clear:

https://github.com/ctrox/zeropod?tab=readme-ov-file#how-it-works

Prerequisites

Let’s not waste time, let’s dive into the experiment. As prerequisites, I needed:

an Ubuntu server with vanilla k3s (flannel + traefik, single node). If you don’t know how to install it, you can always check out my article on the subject.
cert-manager. Not necessarily required but I like having valid HTTPS certificates.

# Install cert-manager CRDs and namespace
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.15.0/cert-manager.yaml

# Wait for cert-manager to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=cert-manager -n cert-manager --timeout=60s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=cainjector -n cert-manager --timeout=60s
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=webhook -n cert-manager --timeout=60s

ClusterIssuer configuration:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
 name: letsencrypt-prod
spec:
 acme:
 server: https://acme-v02.api.letsencrypt.org/directory
 email: your.email@example.org
 privateKeySecretRef:
 name: letsencrypt-prod
 solvers:
 - http01:
 ingress:
 class: traefik

Also optional, for easier access, I modified the traefik service ports to 30080 and 30443 since I don’t have LoadBalancer Service support on this cluster:

kubectl get svc -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cert-manager cert-manager ClusterIP 10.43.212.125 <none> 9402/TCP 21h
cert-manager cert-manager-webhook ClusterIP 10.43.134.29 <none> 443/TCP 21h
default kubernetes ClusterIP 10.43.0.1 <none> 443/TCP 21h
kube-system kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 21h
kube-system metrics-server ClusterIP 10.43.83.225 <none> 443/TCP 21h
kube-system traefik LoadBalancer 10.43.208.56 192.168.1.242 80:30080/TCP,443:30443/TCP 21h

Installing zeropod

Once we have our functional cluster with all the prerequisites, we can install zeropod. The first step “just” involves applying the following kustomize manifest, which will create a customized DaemonSet with the right paths so it can hook into / patch containerd.

apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: zeropod-node
 namespace: zeropod-system
spec:
 template:
 spec:
 volumes:
 - name: containerd-etc
 hostPath:
 path: /var/lib/rancher/k3s/agent/etc/containerd/
 - name: containerd-run
 hostPath:
 path: /run/k3s/containerd/
 - name: zeropod-opt
 hostPath:
 path: /var/lib/rancher/k3s/agent/containerd

kubectl apply -k https://github.com/ctrox/zeropod/config/k3s

Then we’ll label the Node and verify the controller pod is working:

kubectl label node zeropod zeropod.ctrox.dev/node=true
kubectl -n zeropod-system wait --for=condition=Ready pod -l app.kubernetes.io/name=zeropod-node

The documentation indicates that you need to restart the Node in the case of k3s because everything is packaged together in k3s (probably the same for k0s). This is probably not necessary for most “normal” distributions.

NAMESPACE NAME READY STATUS RESTARTS AGE
...
zeropod-system zeropod-node-qntzh 1/1 Running 1 (21h ago) 21h

Interesting point: zeropod will add its own runtimeClass (I’ll let you check out the Kubernetes documentation, it’s worth a look if you’re not familiar):

kubectl get runtimeclass
NAME HANDLER AGE
crun crun 21h
[...]
zeropod zeropod 21h

Deploying a WordPress application

What’s the best use case for Kubernetes?

Hosting a personal blog with WordPress and autoscaling, of course!! Everyone knows that.

Beyond the joke, the idea was to test a stateful application, preferably with a database, to see how far we can push the tool. Because one of the limitations of scale-to-zero tools in Kubernetes is precisely that they work great for stateless workloads (or FaaS), but it’s more complicated when you have state.

First thing to know: beyond the label we put on the node, it’s necessary to add 2 additional configuration points to our applications that we want to scale to zero.

The zeropod annotations; no need to explain what they do: we give it the port number, the container, and the duration after which, if I have no connection, I scale down:

annotations:
 zeropod.ctrox.dev/ports-map: "wordpress=80"
 zeropod.ctrox.dev/container-names: wordpress
 zeropod.ctrox.dev/scaledown-duration: 10s

The runtimeClass that must be defined:

runtimeClassName: zeropod

Application manifests

Here’s what it could roughly look like. We could do cleaner (helm charts) but I did quick and dirty, it’s enough for this PoC:

WordPress Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
 name: php
spec:
 selector:
 matchLabels:
 app: php
 template:
 metadata:
 labels:
 app: php
 annotations:
 zeropod.ctrox.dev/ports-map: "wordpress=80"
 zeropod.ctrox.dev/container-names: wordpress
 zeropod.ctrox.dev/scaledown-duration: 10s
 spec:
 runtimeClassName: zeropod
 initContainers:
 - command:
 - sh
 - -c
 - |
 until mysql -h mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "SELECT 1"; do
 echo "Waiting for MySQL to be ready..."
 sleep 5
 done
 echo "MySQL is ready!"
 mysql -h mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "CREATE DATABASE IF NOT EXISTS wordpress;"
 mysql -h mysql -u root -p${MYSQL_ROOT_PASSWORD} -e "GRANT ALL PRIVILEGES ON wordpress.* TO 'root'@'%';"
 env:
 - name: MYSQL_ROOT_PASSWORD
 value: verySecurePassword
 image: mysql
 imagePullPolicy: IfNotPresent
 name: wait-for-mysql
 containers:
 - env:
 - name: WORDPRESS_DB_HOST
 value: mysql
 - name: WORDPRESS_DB_USER
 value: root
 - name: WORDPRESS_DB_PASSWORD
 value: verySecurePassword
 - name: WORDPRESS_DB_NAME
 value: wordpress
 image: wordpress:latest
 imagePullPolicy: Always
 name: wordpress
 ports:
 - containerPort: 80
 protocol: TCP

MySQL StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: mysql
spec:
 selector:
 matchLabels:
 app: mysql
 serviceName: "mysql"
 replicas: 1
 template:
 metadata:
 labels:
 app: mysql
 spec:
 containers:
 - image: mysql
 name: mysql
 ports:
 - containerPort: 3306
 env:
 - name: MYSQL_ROOT_PASSWORD
 value: verySecurePassword
 volumeMounts:
 - name: data
 mountPath: /var/lib/mysql
 volumeClaimTemplates:
 - metadata:
 name: data
 spec:
 accessModes: ["ReadWriteOnce"]
 resources:
 requests:
 storage: 5Gi

Services and Ingress:

apiVersion: v1
kind: Service
metadata:
 name: php
 labels:
 app: php
spec:
 ports:
 - port: 8080
 name: http
 targetPort: 80
 selector:
 app: php
---
apiVersion: v1
kind: Service
metadata:
 name: mysql
 labels:
 app: mysql
spec:
 ports:
 - port: 3306
 name: mysql
 clusterIP: None
 selector:
 app: mysql
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
 name: zeropod-ingress
 namespace: default
 annotations:
 traefik.ingress.kubernetes.io/router.entrypoints: web,websecure
 cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
 rules:
 - host: zeropod.example.org
 http:
 paths:
 - path: /
 pathType: Prefix
 backend:
 service:
 name: php
 port:
 number: 8080
 tls:
 - hosts:
 - zeropod.example.org
 secretName: zeropod-tls

Observing the behavior

Once deployed, let’s quickly check the state of pods and services:

kubectl get pods
NAME READY STATUS RESTARTS AGE
mysql-0 1/1 Running 0 17h
php-dc7cb9cff-29hzb 1/1 Running 0 17h

kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
[...]
mysql ClusterIP None <none> 3306/TCP 21h
php ClusterIP 10.43.165.131 <none> 8080/TCP 21h

Detecting the absence of traffic

Shortly after deploying Apache PHP, zeropod notices that there hasn’t been a connection for a while and pauses the container.

The funny thing is that from Kubernetes’ point of view, nothing happened! The php-dc7cb9cff-29hzb pod still exists and is present in the Node’s Non terminated pods list:

kubectl get pods php-dc7cb9cff-29hzb
NAME READY STATUS RESTARTS AGE
php-dc7cb9cff-29hzb 1/1 Running 0 17h

kubectl describe nodes
[...]
Non-terminated Pods: (11 in total)
 Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
 --------- ---- ------------ ---------- --------------- ------------- ---
[...]
 default mysql-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
 default php-dc7cb9cff-29hzb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
[...]

However, it no longer appears in the “top pods” metrics collected by metrics-server:

kubectl top pods
NAME CPU(cores) MEMORY(bytes)
mysql-0 4m 460Mi

And if we search for the apache2 process on the node, we won’t find it:

sudo ps -ef | grep apache2

Under the hood, zeropod logs

In the zeropod logs, we first notice the detection of a container eligible for scale to zero:

{"time":"2025-06-20T17:31:46.222502407Z","level":"INFO","msg":"subscribing to status events","sock":"/run/zeropod/s/5858a327ae5a70c2e12b5dad1e8320a4670ff11152476ad49605ebca5327f7d6.sock"}
{"time":"2025-06-20T17:31:47.572065537Z","level":"INFO","msg":"status event","component":"podlabeller","container":"wordpress","pod":"php-dc7cb9cff-29hzb","namespace":"default","phase":1}
{"time":"2025-06-20T17:31:47.5737556Z","level":"INFO","msg":"attaching redirector for sandbox","pid":64980,"links":["eth0","lo"]}

And after 10 seconds, since there hasn’t been a connection, zeropod shuts it down (“phase”:0):

{"time":"2025-06-20T17:31:57.932464147Z","level":"INFO","msg":"status event","component":"podlabeller","container":"wordpress","pod":"php-dc7cb9cff-29hzb","namespace":"default","phase":0}

Performance tests

When the process is checkpointed, we’ll see that curl still takes a bit of time to be served. Nothing dramatic, but still:

time curl https://zeropod.example.org:30443 -I
HTTP/2 200
content-type: text/html; charset=UTF-8
date: Fri, 20 Jun 2025 17:41:29 GMT
link: <https://zeropod.example.org:30443/wp-json/>; rel="https://api.w.org/"
server: Apache/2.4.62 (Debian)
x-powered-by: PHP/8.2.28

real 0m0.454s
user 0m0.052s
sys 0m0.010s

However, for all subsequent connections, the times are correct for an empty (and unoptimized) WordPress:

time curl https://zeropod.example.org:30443 -I
HTTP/2 200
content-type: text/html; charset=UTF-8
date: Fri, 20 Jun 2025 17:41:42 GMT
link: <https://zeropod.example.org:30443/wp-json/>; rel="https://api.w.org/"
server: Apache/2.4.62 (Debian)
x-powered-by: PHP/8.2.28

real 0m0.088s
user 0m0.053s
sys 0m0.008s

So, it works!

The first connection takes a bit longer, while the eBPF program that “listens” to traffic while the container is down turns it back on and hands over control (here +350-400ms on a small, cheap, and quite loaded VM).

Once the container is restored, we can actually see the apache2 processes reappear with a ps on the Node:

ps -ef |grep apache2
root 67038 64955 1 18:56 ? 00:00:00 apache2 -DFOREGROUND
www-data 67055 67038 0 18:56 ? 00:00:00 apache2 -DFOREGROUND
www-data 67056 67038 0 18:56 ? 00:00:00 apache2 -DFOREGROUND
www-data 67057 67038 0 18:56 ? 00:00:00 apache2 -DFOREGROUND
www-data 67058 67038 0 18:56 ? 00:00:00 apache2 -DFOREGROUND
www-data 67059 67038 0 18:56 ? 00:00:00 apache2 -DFOREGROUND
www-data 67060 67038 0 18:56 ? 00:00:00 apache2 -DFOREGROUND

Going further: testing with MySQL

Well, on the other hand, scaling an http server to 0 is fun, but it’s not revolutionary. There are already scale-to-zero solutions on the market in Kubernetes, especially for FaaS or stateless workloads. But what if we push the experiment all the way?

We rarely want to scale a database to 0 in real life. Stopping it and then restarting it can take time, and potentially cause errors in our apps if the scaling is done poorly.

However, with zeropod, the principle is a bit different, since we’re not really going to stop the process, just freeze it.

For science (don’t do this in prod), I therefore added zeropod to the MySQL database too!

As with wordpress, we add the annotations and runtimeClass:

template:
 metadata:
 labels:
 app: mysql
 annotations:
 zeropod.ctrox.dev/ports-map: "mysql=3306"
 zeropod.ctrox.dev/container-names: mysql
 zeropod.ctrox.dev/scaledown-duration: 10s
 spec:
 runtimeClassName: zeropod
 containers:
 # ...

After a few seconds, zeropod moves the mysql database to phase “0”:

{"time":"2025-06-20T18:06:14.92372877Z","level":"INFO","msg":"status event","component":"podlabeller","container":"mysql","pod":"mysql-0","namespace":"default","phase":1}
{"time":"2025-06-20T18:06:14.925316023Z","level":"INFO","msg":"attaching redirector for sandbox","pid":69570,"links":["eth0","lo"]}
{"time":"2025-06-20T18:06:25.766097339Z","level":"INFO","msg":"status event","component":"podlabeller","container":"mysql","pod":"mysql-0","namespace":"default","phase":0}

kubectl top pods no longer reports any pods (and an error…):

kubectl top pods
error: Metrics not available for pod default/php-dc7cb9cff-29hzb, age: 35m59.026573861s

But the pods remain visible and “Running” from Kubernetes’ point of view:

kubectl get pods
NAME READY STATUS RESTARTS AGE
mysql-0 1/1 Running 0 2m30s
php-dc7cb9cff-29hzb 1/1 Running 0 36m

Final test: cascading wake-up

The final test: will an HTTP call to php wake it up, which will trigger a connection to the mysql database which will in turn wake it up too?

Drum roll

time curl https://zeropod.example.org:30443 -I
HTTP/2 200
content-type: text/html; charset=UTF-8
date: Fri, 20 Jun 2025 18:09:24 GMT
link: <https://zeropod.example.org:30443/wp-json/>; rel="https://api.w.org/"
server: Apache/2.4.62 (Debian)
x-powered-by: PHP/8.2.28

real 0m0.978s
user 0m0.053s
sys 0m0.008s

Victory!!

Limitations

Beyond this somewhat silly example (who hasn’t wanted to host a wordpress on Kubernetes with scale to zero?), we realize that the technology “works” but remains a bit flaky.

I had several cases where scale to zero happened on the php pod, while I was running a loop with “while true; do curl” (maybe related to the ingress -> service -> pod chain?). And the checkpointing time is still visible on my test VM (400 ms per container, that’s not nothing).

One point that is not addressed in the project documentation is that it’s almost impossible to have proper liveness / readiness probes when you use zeropod.

If you put a liveness probe on a web service, you’ll trigger a call on the TCP port listened to by zeropod, and thus restart the app which will never be checkpointed. If you put a readiness probe, same thing.

And if you plan to get around this with a liveness / readiness probe that doesn’t trigger an HTTP call on the port monitored by zeropod, you’ll end up with an app seen by Kubernetes as (respectively) KO or “Not ready”, since the container won’t be available because it’s checkpointed.

an open issue on the subject - github.com/ctrox/zeropod/issues/34

Technically, zeropod is very fun and quite clever, I’m quite impressed.

But I don’t really see in what world we would want to have containers in prod without liveness / readiness probes, so I’m quite skeptical about using this technology as is, except for very non-critical examples. This limitation seems too big to me.

Recompile Mimir’s "MetaMonitoring" Grafana Dashboards for Kubernetes

Thu, 12 Dec 2024 18:00:00 +0200

Context

When working on observability, there is a tool that always comes in mind first : Grafana.

Grafana is an open source visualization tool developed by Grafana Labs, and I’m sure you all know it (and I also wrote about it in French quite a few times). But aside from this, they also develop a lot of other useful tools in the observability landscape, to the point that you can in theory build you whole o11y stack with only Grafana Labs Tools.

To answer Prometheus lack of long term storage and lack of high availability features (I have NEVER understood why the Prometheus team refuse working on this), Grafana Labs forked Cortex a few years back and renamed it Mimir.

I won’t cover the installation of Mimir here, there are plenty of tutorial on the Internet and an official documentation for this.

Instead, I’ll talk about an issue that I have with the official Mimir helm chart, and more precisely with the built-in Grafana dashboards that come along with it.

Dashboards, you say?

Mimir is shipped with a lot of useful Grafana dashboards to help ensure that the components are running fine.

These dashboards are compatible with the various deployment modes of Mimir. In Kubernetes, if you use the mimir-distributed official helm chart this can be enabled by a simple value:

metaMonitoring:
 dashboards:
 enabled: true

But, by default, all dashboards installed using the metaMonitoring value in the mimir helm charts are precompiled JSON manifests using jsonnet/mixin.

For example, here is the precompiled version of the “Mimir / Overview dashboard”:

github.com/grafana/mimir/blob/2640b8f72127548e9e3da281a763476b03fb4aae/operations/mimir-mixin-compiled/dashboards/mimir-overview.json

By design, you can’t change things like the prefix name of the mimir pods, which makes these precompiled dashboards useless in a helm-like environment where release name (mimir-) is a prefix of the pod.

kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
mimir-alertmanager-0 1/1 Running 0 24h
mimir-alertmanager-1 1/1 Running 0 24h
mimir-compactor-0 1/1 Running 0 24h
mimir-distributor-5d668b479f-ksltr 1/1 Running 1 (24h ago) 6d1h
...

In this case, all dashboards will be broken, all showing “no data” in Grafana because data will be incorrectly filtered. For example, the “Write requests / sec” panel in “Mimir / Overview dashboard”, has a label job=~"($namespace)/((distributor..., but our pod is mimir-distributor, not distributor:

sum by (status) (
label_replace(label_replace(rate(cortex_request_duration_seconds_count{cluster=~"$cluster", job=~"($namespace)/((distributor.*|cortex|mimir|mimir-write.*))", route=~"/distributor.Distributor/Push|/httpgrpc.*|api_(v1|prom)_push|otlp_v1_metrics"}[$__rate_interval]),
"status", "${1}xx", "status_code", "([0-9]).."),
"status", "${1}", "status_code", "([a-zA-Z]+)"))

The solution is to disable the metaMonitoring flag from the chart, and build / ship the dashboards separately.

Procedure

Get the Mimir sources:

git clone https://github.com/grafana/mimir.git

Hopefully, the jsonnet/mixin files include a job_prefix variable that will help us fix this:

sed -i.bak "s/job_prefix: '(\$namespace)\/',/job_prefix: '(\$namespace)\/mimir-',/" operations/mimir-mixin/config.libsonnet

Rebuild the dashboards

make build-mixin

podman image inspect grafana/mimir-build-image:pr9491-80f5778956 >/dev/null 2>&1 || podman pull grafana/mimir-build-image:pr9491-80f5778956
podman tag grafana/mimir-build-image:pr9491-80f5778956 grafana/mimir-build-image:latest
[...]
make: Leaving directory '/go/src/github.com/grafana/mimir'
 10,10 real 0,02 user 0,01 sys

Note: If you don’t have docker on your machine (I use podman), the make command will fail because it can’t find docker and the docker binary is hardcoded in the make commands. Modify the Makefile to replace docker by podman.

The json files in operations/mimir-mixin-compiled/dashboards are now built with the correct pod names.

Create a grafana-dashboards helm chart (called yourDashboardsChart here).

helm create yourDashboardsChart

In this chart, create a src/dashboards/mimir directory (for json dashboard sources) alongside the classic templates directory containing the actual go-templated YAML manifests. We will create the gotemplate helm files just after:

cp operations/mimir-mixin-compiled/dashboards/* ../yourDashboardsChart/src/dashboards/mimir

Now, for each json file generated by jsonnet, we are going to create a helm gotemplated yaml file, which in turn will create a ConfigMap for each dashboard in our Kubernetes cluster. They will look like this:

---
# Source: mimir-distributed/templates/metamonitoring/grafana-dashboards.yaml
apiVersion: v1
kind: ConfigMap
metadata:
 name: mimir-alertmanager-dashboard
 namespace: '{{ $.Release.Namespace }}'
 labels:
 grafana_dashboard: "1"
 annotations:
 k8s-sidecar-target-directory: /tmp/dashboards/Mimir Dashboards
data:
 mimir-alertmanager.json: |-
 {{ $.Files.Get "src/dashboards/mimir/mimir-alertmanager.json" | fromJson | toJson }}

To speed up the process, you can reuse the helm template and a few bash commands to generate all the helm gotemplate files for you:

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

mkdir -p mimir

helm -n monitoring template mimir grafana/mimir-distributed --set metaMonitoring.dashboards.enabled=true > helm-output.yaml

# Count the number of document separators
doc_count=$(grep -c '^---$' helm-output.yaml)

# Split the YAML file into separate files for each document
csplit -f mimir/helm-output- helm-output.yaml '/---/' "{$((doc_count - 2))}" >/dev/null

# Some triming/cleaning
for file in mimir/helm-output-*; do
 if grep -q 'kind: ConfigMap' "$file" && grep -q 'dashboard' "$file"; then
 name=$(yq eval '.metadata.name' "$file")
 yq eval -i 'del(.metadata.labels."helm.sh/chart", .metadata.labels."app.kubernetes.io/name", .metadata.labels."app.kubernetes.io/instance", .metadata.labels."app.kubernetes.io/version", .metadata.labels."app.kubernetes.io/managed-by")' "$file"
 yq eval -i '.metadata.namespace = "{{ $.Release.Namespace }}"' "$file"
 yq eval -i '.data |= with_entries(.value = "{{ $.Files.Get \"src/dashboards/mimir/" + .key + "\" | fromJson | toJson }}")' "$file"
 mv "$file" "mimir/${name}.yaml"
 else
 rm "$file"
 fi
done

mv mimir/* ../yourDashboardsChart/templates/mimir

Now, you should have all the files to re-generate the grafana dashboard in your Kubernetes cluster, with the correct prefix.

Enjoy!

Source

Kubernetes resource optimization with horizontal pod autoscaling via custom metrics and Prometheus Adapter

Fri, 11 Oct 2024 10:00:00 +0200

Introduction

Note: this article has historically been co-written with my former colleague Gaby Fulchic, aka Weeking. It was posted on my previous employer’s corporate blog. This version is a more personal / less corporate version, which is also available in french.

If you’ve been living in a cave for the last 10 years and have never heard of Kubernetes, well… I invite you to check out my other articles on the subject (in french) xD.

In this article, I wanted to dig into a feature that’s been around for a while, but that I’ve rarely needed: HorizontalPodAutoscalers, particularly through the use of custom metrics.

Ready to scale? Let’s go!

What the heck is horizontal pod autoscaling (HPA)???

The HorizontalPodAutoscaler is a Kubernetes feature. It allows you to specify, for given metrics on a group of Pods, to try to reach target values. The most basic use of this feature is, you guessed it, to “scale” Pods based on basic metrics, for example CPU consumption.

Like everything in Kubernetes, it’s an API (currently autoscaling/v2). The simplest way to interact with it is to create a YAML manifest file where you describe the desired state of your application based on load.

By default, only simple metrics, CPU and memory consumption (those collected by metrics-server) are available to specify scaling rules.

A simple example could look like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: myapp-hpa
 namespace: mynamespace
spec:
 maxReplicas: 6
 metrics:
 - resource:
 name: cpu
 target:
 averageUtilization: 50
 type: Utilization
 type: Resource
 minReplicas: 2
 scaleTargetRef:
 apiVersion: apps/v1
 kind: Deployment
 name: myapp

Note: HPA specifications can be incredibly more complex and powerful, as the API has been significantly enriched over the years. Don’t hesitate to read the official documentation ;-).

Based on the previous example, once the manifest is applied, Kubernetes will try to maintain an average CPU utilization around 50% across all “myapp” pods and will add replicas if the average CPU consumption exceeds this threshold. As soon as CPU consumption drops below the target, Kubernetes reduces the number of replicas, down to the minimum number if necessary.

Well, that’s the theory. But from experience, using HPA this way has limitations:

Modern applications often have complex performance characteristics that are imperfectly described by CPU and RAM usage alone. For example, an application may be limited by input/output (I/O). Other factors like request latency, business metrics, or indicators on external dependencies (the number of messages in a queue, for example) can provide a better basis for scaling decisions.
CPU usage can also be very high during the “boot” of a new pod, which can lead to more scaling than necessary (we talk about boot storms on more traditional infrastructures, I find the term appropriate).

The HorizontalPodAutoscaler reacts to metrics retrieved at a given moment, which means there can be a lag between the metric peak and the scaling response. This can lead to temporary performance degradation. Finding metrics that allow anticipating the need for scaling rather than reacting after degradation is therefore the objective to keep in mind to improve the reliability of our apps.

To address these limitations, Kubernetes allows the use of custom metrics offering greater flexibility and better control over application scaling behavior. This is where tools like Prometheus and Prometheus Adapter come in, which will allow us more adapted / effective autoscaling strategies.

Prometheus and metrics via `/-/metrics`

Like Kubernetes, Prometheus is another major project under the CNCF umbrella. It’s a metrics collection tool, which has a time series database (TSDB) optimized for storing infrastructure metrics and a query language allowing deep but easy and powerful analyses of these metrics. Again, I’ve already written several articles on the subject (in french).

Generally, we classify monitoring tools into two broad categories. Those that receive metrics from clients that “push” them and those that periodically “pull” metrics from the applications themselves. Prometheus uses the “pull” strategy (most of the time) and, by default, it will collect our metrics every 30 seconds.

This means you don’t need to install an “agent” on your applications BUT you must specify to Prometheus a list of “targets” that expose HTTP endpoints (your applications) serving metrics in a specific format, usually on the path /-/metrics:

$ kubectl -n mynamespace port-forward myapp-5584c5c8f8-gbsw8 3000
Forwarding from 127.0.0.1:3000 -> 3000
Forwarding from [::1]:3000 -> 3000

# in another terminal
$ curl localhost:3000/-/metrics/ 2> /dev/null | head
# HELP http_request_duration_seconds duration histogram of http responses labeled with: status_code method path
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.0002"status_code="200"method="GET"path="/-/health"} 0
http_request_duration_seconds_bucket{le="0.0005"status_code="200"method="GET"path="/-/health"} 364
[...]

We can then query Prometheus to get these metrics by specifying the metric name and adding labels to specify which subset we’re interested in:

http_request_duration_seconds_bucket{kubernetes_namespace='mynamespace' kubernetes_pod_name="myapp-5584c5c8f8-gbsw8"}

I’ll assume you have a Prometheus available on your cluster for the rest of the article (otherwise, check out the Prometheus Operator).

Thanks to Prometheus, we now have a multitude of metrics to choose from to predict whether our applications should be scaled proactively. However, the problem is that we can’t tell the HPA to directly monitor these metrics, because the HPA isn’t directly compatible with the Prometheus query language PromQL.

Prometheus Adapter to the rescue

So we need another tool that will retrieve metrics from Prometheus and provide them to Kubernetes. You’ve guessed which software it is now: Prometheus Adapter.

We’ll install it from a Helm chart hosted on the prometheus-community repository:

$ helm show values prometheus-community/prometheus-adapter > values.yaml

$ helm install -n monitoring prometheus-adapter prometheus-community/prometheus-adapter -f values.yaml

$ kubectl -n monitoring get deployments.apps
NAME READY UP-TO-DATE AVAILABLE AGE
metrics-server 2/2 2 2 1d
prom-operator-kube-state-metrics 1/1 1 1 1d
prom-operator-operator 1/1 1 1 1d
prometheus-adapter 1/1 1 1 1d
prom-operator-query 3/3 3 3 1d

In this example, you can see that I’ve already deployed metrics-server and Prometheus using Prometheus Operator, and that Prometheus Adapter is running.

By default, Prometheus Adapter will be deployed with certain custom metrics that we can use “out of the box” to scale our applications more precisely. But in this article, we’re going to show you how to create ✨ your own ✨ metrics.

Configuring Prometheus Adapter to expose custom metrics via the API Server

Initially, the Prometheus Adapter configuration contains no rules. This means no custom metrics are exposed via the API Server at the start, and HPA can’t use custom metrics.

Prometheus Adapter works in this order:

Discover metrics by contacting Prometheus
Associate them with Kubernetes resources (namespace, pod, etc.)
Check how to expose them (if necessary, it can rename metrics)
Check how to query Prometheus to get actual values (e.g., compute a “rate”).

First, we need to validate the custom metrics API on our cluster. The resources list will be empty, but this proves that the custom.metrics.k8s.io/v1beta1 API is accessible.

└─[$] kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq
{
 "kind": "APIResourceList",
 "apiVersion": "v1",
 "groupVersion": "custom.metrics.k8s.io/v1beta1",
 "resources": []
}

Several steps are then needed for Prometheus Adapter to collect and provide metrics to the Kubernetes API server. All Prometheus Adapter configuration can be adjusted via the Helm chart’s values.yaml file.

The first thing to configure here is “where” Prometheus Adapter can contact Prometheus. If Prometheus Adapter is running on the same cluster as the Prometheus stack, you can use an internal (to Kubernetes) DNS record as below (see Kubernetes DNS documentation for services and pods). Otherwise, you can specify an IP address (or DNS name) and port number.

values.yaml > prometheus:
 url: http://prom-operator-query.monitoring.svc
 port: 9090
 path: ""

To verify that Prometheus Adapter can properly contact Prometheus, you just need to check the pod logs (using kubectl logs pod/prometheus-adapter-abcdefgh-ijklm or any other means at your disposal to read pod logs).

Once this part is operational, we need to add some rules to our Prometheus Adapter.

In this example, I chose to use a metric called ELU (for “Event Loop Utilization”) collected from a Node.js server. It measures how much time the Node.js event loop is busy processing events versus being idle, and it’s more representative of server load than simple CPU percentage.

Rules allow us to specify what to query in Prometheus. We can define which labels to import and, if necessary, replace them to match Kubernetes resource names. Here are the most useful values to specify:

seriesQuery: executes the PromQL query, possibly filtered
resources: maps time series labels to Kubernetes resources
name: exposes time series with different names from the originals
metricsQuery: method to ask Prometheus to get a rate («.GroupBy» means “group by Pod” by default)

values.yaml > rules:
 default: false
 custom:
 - seriesQuery: 'elu_utilization{kubernetes_namespace!=""kubernetes_pod_name!=""}'
 resources:
 overrides:
 kubernetes_namespace: {resource: "namespace"}
 kubernetes_pod_name: {resource: "pod"}
 name:
 matches: ^elu_utilization$
 as: ""
 metricsQuery: sum(<<.Series>>{<<.LabelMatchers>>}) by (<<.GroupBy>>)

You should now be able to get some custom metrics with actual values by accessing the API server. To test, we’ll use kubectl and the --raw parameter, which gives us more control over requests sent to the API server.

Here are some example commands you can run to manually verify that metrics are correctly exposed via the API Server:

# list custom metrics discovery
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq

# list custom metric values for each pod
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/<namespace_name>/pods/*/elu_utilization" | jq

Warning: scaling will be heavily dependent on your metrics scraping interval and your Prometheus Adapter discovery interval. The official documentation insists that you might encounter problems if you set a value too low.

“You’ll need to also make sure your metrics relist interval is at least your Prometheus scrape interval. If it’s less than that, you’ll see metrics periodically appear and disappear from the adapter.”

How will this work?

So far, we’ve introduced several components that interact with each other.

But how will all this work under the hood? Well, nothing better than a diagram to explain things like this:

Prometheus scrapes metrics exposed by our application
Prometheus Adapter queries the Prometheus server to collect the specific metrics we defined in its configuration
The HorizontalAutoscaler (the controller that manages HPAs) will query the API server to periodically check if the ELU metric is within acceptable limits…
… which in turn will ask Prometheus Adapter.

Let’s now create our first HorizontalPodAutoscaler!

Using a HorizontalPodAutoscaler resource with custom metrics

At the beginning of this post, we introduced the HorizontalPodAutoscaler API. The resource itself isn’t hard to use. Basically, HPA takes a target deployment to scale, a minimum number of replicas, a maximum number of replicas, and the metrics to use. For the metrics part, we’ll now use our article’s custom metric configured with Prometheus Adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
spec:
 maxReplicas: 6
 metrics:
 - pods:
 metric:
 name: elu_utilization
 target:
 averageValue: 500m
 type: Utilization
 type: Pods
 minReplicas: 2
 scaleTargetRef:
 apiVersion: apps/v1
 kind: Deployment
 name: <deployment.apps_name>

When you configure an HPA resource, you only define the metric name. But how can HPA determine the right metric from the right pods since several applications can expose this same metric? To understand this, we can examine Prometheus Adapter logs.

I0618 12:05:29.149095 1 httplog.go:132] "HTTP" verb="GET" URI="/apis/custom.metrics.k8s.io/v1beta1/n

As you can see, a labelSelector is added to the request. And since you mention the Deployment in the HPA’s scaleTargetRef reference, the latter uses the labelSelector value from the Deployment’s label selector. This allows you to target metrics from a specific Deployment. And these labels exist because, when scraping pods with Prometheus, Kubernetes pod discovery component adds them to the metrics.

If you want to use a custom labelSelector in the query, add the metrics.pods.metric.selector field to the HPA resource.

So we have the custom metrics API, we’ve configured Prometheus Adapter to discover and expose certain metrics, and we’ve created our first HPA resource. It’s now time to test the deployment under load and observe the behavior.

For this, we’ll present you with a tool named Vegeta (it’s over 9000!)

Vegeta is a versatile HTTP load testing tool built out of a need to drill HTTP services with a constant request rate.

We’ll use Vegeta to generate load on our application while monitoring the application’s pods and HPA status (with 3 terminals open in parallel):

kubectl get hpa/<myhpa> -w -n <mynamespace>
kubectl get po -l app=<myapp> -w -n <mynamespace>
vegeta attack <app_endpoint_http>

Note: In case your application can support a significant load and default parameters don’t trigger scaling, you can modify some parameters in the Vegeta command. We recommend using the workers and rate options:

–workers: initial number of workers (default 10)
–rate: number of requests per time unit [0 = infinite] (default 50/1s)

When load increases, your custom metric value will also increase, which should in turn trigger deployment scaling once thresholds are reached.

Using Prometheus Adapter in production

When Prometheus Adapter becomes a central component of your architecture, its tuning and monitoring become essential.

If this component is down, your HPA won’t be able to react anymore. There will be two potential impacts: you’ll use too many resources for current traffic, or conversely, not have enough to handle traffic. In any case, your workloads aren’t immediately affected; they maintain the last number of replicas calculated by HPA before the outage.

To prevent this from becoming a SPOF, make sure to put more than one replica for Prometheus Adapter. And I also advise adding a PodDisruptionBudget to avoid issues during your cluster maintenance.

Conclusion

Kubernetes’ built-in horizontal pod autoscaling is a standard mechanism that can potentially help your apps efficiently handle variable loads. Personally, I find the classic HPA, which uses CPU and memory metrics, too limited. But with the integration of custom metrics with Prometheus Adapter, we can make scaling decisions more precise and relevant.

While installing Prometheus Adapter is simple, its configuration is, I find, a bit counter-intuitive, even complex, without effectively handling the most advanced scenarios.

That’s why I think that, if you don’t have a requirement to stick with the Kubernetes standard, you should take a look (or wait for my next article?) at KEDA (Kubernetes Event-Driven Autoscaling), another open-source project that extends HPA capabilities by supporting various event sources and scaling triggers.

Happy scaling!

Additional sources

Prometheus Adapter documentation:

Kubernetes HPA documentation:

Other:

RKE, Talos Linux, ... : MountVolume.NewMounter initialization failed for volume : path does not exist

Thu, 26 Sep 2024 20:00:00 +0200

It all started…

Without revealing too much about my new job, I work on a Kubernetes platform that we manage ourselves with Talos Linux (an immutable Kubernetes OS that’s pretty cool!).

The platform I’m working on is still quite young, and on my development environment, I have what I need for S3 buckets but no PVC with “block” type storage. I could have set up a rook with haste.

But what can I say… Out of professional conscientiousness (aka being too lazy to type), and to save myself 4 and a half seconds of Google/ChatGPT queries, I ask a colleague if he doesn’t have a little piece of YAML to set up a small PVC using a hostPath quickly.

cf kubernetes.io/docs/concepts/storage/storage-classes/#local

It starts badly

First of all, if you try to run this manifest, it won’t work. Indeed, what the official documentation doesn’t say (not on the storage classes page anyway) is that the Local provisioner kubernetes.io/no-provisioner can’t create a PV if you don’t specify an affinity (nodeAffinity).

% kubectl apply -f toto.yaml
storageclass.storage.k8s.io/for-science-sc created
The PersistentVolume "for-science-pv" is invalid: spec.nodeAffinity: Required value: Local volume requires node affinity

This is easily corrected:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
 name: for-science-sc
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
 name: for-science-pv
spec:
 accessModes:
 - ReadWriteOnce
 capacity:
 storage: 10Gi
 local:
 path: /opt/tempDir
 storageClassName: for-science-sc
 nodeAffinity:
 required:
 nodeSelectorTerms:
 - matchExpressions:
 - key: kubernetes.io/hostname
 operator: In
 values:
 - worker-1

From there, it works (on paper). Once the manifests are created, you should have a storage class and a PV, ready to be used. We can even go one step further and create the PVC and bind it to the PV.

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: for-science-pvc
 namespace: default
spec:
 accessModes:
 - ReadWriteOnce
 resources:
 requests:
 storage: 10Gi
 storageClassName: for-science-sc

End of article?

kubectl apply -f toto.yaml
storageclass.storage.k8s.io/for-science-sc unchanged
persistentvolume/for-science-pv created
persistentvolumeclaim/for-science-pvc created

No, of course not…

Run, pod! Ruuuuuun!

So let’s launch a small debug Pod to see if we can mount our PVC in a container:

---
apiVersion: v1
kind: Pod
metadata:
 name: ubuntu-pvc
spec:
 containers:
 - name: ubuntu
 image: ubuntu:latest
 command: ["/bin/bash", "-c", "sleep 3600"]
 volumeMounts:
 - name: storage
 mountPath: /toto
 volumes:
 - name: storage
 persistentVolumeClaim:
 claimName: for-science-pvc
 restartPolicy: Never

Here, no need to specify the affinity that the Pod needs to be scheduled on worker-1 (where the PV is located). The Kubernetes scheduler is smart enough to find it on its own.

% kubectl apply -f toto2.yaml
pod/ubuntu-pvc created

And there, disaster strikes. The Pod stays stuck in ContainerCreating!!

% kubectl get pods
NAME READY STATUS RESTARTS AGE
ubuntu 0/1 ContainerCreating 0 52s

% kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
4m6s Warning ProvisioningFailed persistentvolumeclaim/for-science-pvc storageclass.storage.k8s.io "for-science-pv" not found
100s Normal WaitForFirstConsumer persistentvolumeclaim/for-science-pvc waiting for first consumer to be created before binding
88s Normal Scheduled pod/ubuntu-pvc Successfully assigned default/ubuntu to worker-1
25s Warning FailedMount pod/ubuntu-pvc MountVolume.NewMounter initialization failed for volume "for-science-pv" : path "/opt/tempDir" does not exist

Hmm… ok, path "/opt/tempDir" does not exist, that’s relatively explicit as an error. I forgot to create the folder on the host in question.

I’ll create a new Pod that will mount the host’s /opt in /mnt/opt, and create the folder manually:

---
apiVersion: v1
kind: Pod
metadata:
 name: ubuntu-hostpath
 namespace: somespecialnamespace
spec:
 containers:
 - name: ubuntu
 image: ubuntu:latest
 command: ["/bin/bash", "-c", "sleep 3600"]
 volumeMounts:
 - mountPath: /host/opt
 name: host
 securityContext:
 privileged: true
 volumes:
 - name: mnt
 hostPath:
 path: /host
 type: Directory
 restartPolicy: Never
 affinity:
 nodeAffinity:
 requiredDuringSchedulingIgnoredDuringExecution:
 nodeSelectorTerms:
 - matchExpressions:
 - key: kubernetes.io/hostname
 operator: In
 values:
 - worker-1

Note: Talos’s PodSecurity prevents me from creating the Pod (hostPath & privilege) if I don’t create it in a namespace with exceptions. But let’s say we fixed it.

% kubectl exec -it -n somespecialnamespace pods/ubuntu-hostpath -- /bin/bash
root@ubuntu-hostpath:/# mkdir /host/opt/tempDir/
[control] + d
exit
command terminated with exit code 1

Is it good now?

Directory: created. Mission accomplished, right? Right?

Well no!! The Pod is still stuck.

Why on earth can I see this folder in my ubuntu-hostpath Pod, but not in ubuntu-pvc, which mounts the same folder, but via a PVC???

The reason is… Because it’s a PVC.

Ok, alright. It’s not JUST that. It’s because on one side, I go through a hostpath and on the other through a PVC.

The most astute among you will surely have guessed that there’s a connection with the title of my article. The problem is the same on RKE, Talos Linux and surely other similar OSes.

These OSes have the particularity of being “Container OS”. The kubernetes components aren’t “installed” properly speaking in the OS, but launched in containers. The OS is actually just a big empty shell with just enough to launch containers. This limits the attack surface and makes them rather lightweight.

And there’s a fundamental difference between how “a Pod with hostPath” and “a pod mounting a PVC” are managed. In the case of a PVC provisioned with kubernetes.io/no-provisioner, the kubelet needs to be able to mount the folder before the Pod.

You see where I’m going? In a container OS, the kubelet is a container. It doesn’t see anything other than its own container directory tree (not the host’s /, just the contents of the big ZIP we downloaded from dockerhub with docker pull), and possibly the host folders that were explicitly mounted to it.

In my case, for the kubelet container, this /opt/tempDir folder doesn’t exist, simply because in its container directory tree, it wasn’t mounted (or created).

The solution

Actually, as often, you just had to read the documentation. Talos Linux gives 2 methods to solve this problem:

explicitly mount the folders we need in the kubelet container
go through a third-party provider that will create PVCs from folders on the host, without going through the kubelet.
www.talos.dev/v1.8/kubernetes-guides/configuration/local-storage/

I tested both, they work perfectly. But beyond the solution, it’s more the journey that I found interesting to share with you :)

Some links on the subject

Kubernetes 1.29 - sidecar containers - what are they good for?

Fri, 19 Jul 2024 12:00:00 +0200

Introduction

A friend of mine needs to run a periodic job. This job runs as code in a container (on a GKE cluster) and needs to query a Cloud SQL database.

For this kind of use case, Google Cloud provides an image to deploy alongside your application (sidecar pattern) that acts as a proxy so the application can connect to the Cloud SQL instance on Google Cloud.

cloud.google.com/sql/docs/mysql/connect-kubernetes-engine

Unfortunately, this image sometimes takes a while to start, and its absence causes my friend’s application to crash. We end up with a race condition and he asked me if I had any ideas to solve this problem.

Among other suggestions (which I detail in the very last paragraph), I suggested he try a brand new feature that went beta in Kubernetes 1.29 and that I hadn’t tested myself yet: sidecar containers!

kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/

Fun fact: while searching for documentation on cloud sql sidecar, I stumbled upon this article from someone who has the exact same problem as my friend.

hwchiu.medium.com/exploring-kubernetes-1-28-sidecar-container-support-ed1a39ac7fe0

For once, so you can properly understand the problem and its resolution, I’ll walk you through this feature discovery with a demo.

All the code and instructions are available on the GitHub repository github.com/zwindler/sidecar-container-example.

The idea is as follows: we’ll simulate my buddy’s problem with two Docker images created for the occasion:

zwindler/slow-sidecar a basic helloworld in V lang (vhelloworld) that sleeps for 5 seconds before listening on port 8081.
zwindler/sidecar-user a bash script that curls and exit 1 if the curl fails.

Prerequisites

As mentioned earlier, the feature was introduced in Kubernetes 1.28 as an alpha feature. If you’re using this version and want to test it, you need to specifically enable the feature flag.

Starting with Kubernetes 1.29, this feature moved to beta and should be enabled by default on your cluster.

Without sidecar containers

First, let’s try to deploy the CronJob naively on a cluster:

$ cat 1-cronjob-without-sidecar-container.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
 name: sidecar-cronjob
spec:
 schedule: "* * * * *"
 jobTemplate:
 spec:
 template:
 spec:
 containers:
 - name: sidecar-user
 image: zwindler/sidecar-user
 - name: slow-sidecar
 image: zwindler/slow-sidecar
 ports:
 - containerPort: 8081
 restartPolicy: Never

$ kubectl apply -f 1-cronjob-without-sidecar-container.yaml

This should fail because the “slow sidecar” container won’t be ready when the “sidecar user” container tries to curl.

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
sidecar-cronjob-28689938-5n5x9 1/2 Error 0 9s

$ kubectl describe pods sidecar-cronjob-28689938-5n5x9
[...]
Containers:
 slow-sidecar:
[...]
 State: Running
 Started: Fri, 19 Jul 2024 15:38:03 +0200
 Ready: True
[...]
 sidecar-user:
[...]
 State: Terminated
 Reason: Error
 Exit Code: 1
 Started: Fri, 19 Jul 2024 15:38:05 +0200
 Finished: Fri, 19 Jul 2024 15:38:05 +0200
 Ready: False
 Restart Count: 0
[...]

slow-sidecar is running fine but our sidecar-user request failed because the sidecar was too slow to start.

Quick cleanup before we try again:

kubectl delete cronjob sidecar-cronjob

Using an init container isn’t an option either because the init container will never terminate (that’s not its purpose) and the “sidecar user” container will wait forever for its turn. If you want to try, just convert slow-sidecar to an initContainer.

apiVersion: batch/v1
kind: CronJob
metadata:
 name: sidecar-cronjob
spec:
 schedule: "* * * * *"
 jobTemplate:
 spec:
 template:
 spec:
 containers:
 - name: sidecar-user
 image: zwindler/sidecar-user
+ initContainers:
 - name: slow-sidecar
 image: zwindler/slow-sidecar
 ports:
 - containerPort: 8081
 restartPolicy: Never

And run it

$ kubectl apply -f 2-cronjob-with-init-container.yaml

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
sidecar-cronjob-28689955-lzbnf 0/1 Init:0/1 0 27s

And we’re stuck at this step until the end of tiiiiiime.

With sidecar containers

To avoid this type of race condition, let’s update the manifest by converting slow-sidecar to an initContainer BUT ALSO adding restartPolicy: Always in the slow-sidecar container declaration.

This trick is the way to tell Kubernetes to start this container as an initContainer but NOT to wait for it to finish (which it will never do since it’s a web server listening on 8081 until the end of time) before starting the main application.

apiVersion: batch/v1
kind: CronJob
metadata:
 name: sidecar-cronjob
spec:
 schedule: "* * * * *"
 jobTemplate:
 spec:
 template:
 spec:
 containers:
 - name: sidecar-user
 image: zwindler/sidecar-user
+ initContainers:
 - name: slow-sidecar
 image: zwindler/slow-sidecar
+ restartPolicy: Always
 ports:
 - containerPort: 8081
 restartPolicy: Never

Note: This is the official way to declare a sidecar container in Kubernetes. I haven’t read the KEP yet so I can’t say why the development team didn’t introduce a new keyword sidecarContainers in the Pod spec schema and reused the existing initContainers instead.

$ kubectl apply -f 3-cronjob-with-sidecar-container.yaml

This time, the init container should start and ONLY THEN, the application:

$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
sidecar-cronjob-28689958-zrmhh 0/2 Pending 0 0s
sidecar-cronjob-28689958-zrmhh 0/2 Pending 0 0s
sidecar-cronjob-28689958-zrmhh 0/2 Init:0/1 0 0s
sidecar-cronjob-28689958-zrmhh 1/2 PodInitializing 0 2s
sidecar-cronjob-28689958-zrmhh 1/2 Error 0 3s

We can see it’s better (sidecar-user starts in a second phase) but in this particular example, it still fails…

With sidecar containers AND a startupProbe

By default, the kubelet considers the sidecar container to be up as soon as the process in the container is running, then if the other initContainers have all finished (or if there are none), moves to the main phase of starting containers.

Unfortunately, in our case, the sidecar container is very slow (sleep 5), so the fact that the process is running is not an indication of the sidecar’s state…

We need to add a startupProbe so Kubernetes knows WHEN to move past the init phase and start the main phase.

After a sidecar-style init container is running (the kubelet has set the started status for that init container to true), the kubelet then starts the next init container from the ordered .spec.initContainers list. That status either becomes true because there is a process running in the container and no startup probe defined, or as a result of its startupProbe succeeding.

apiVersion: batch/v1
kind: CronJob
metadata:
 name: sidecar-cronjob
spec:
 schedule: "* * * * *"
 jobTemplate:
 spec:
 template:
 spec:
 containers:
 - name: sidecar-user
 image: zwindler/sidecar-user
 initContainers:
 - name: slow-sidecar
 image: zwindler/slow-sidecar
 restartPolicy: Always
 ports:
 - containerPort: 8081
+ startupProbe:
+ httpGet:
+ path: /
+ port: 8081
+ initialDelaySeconds: 5
+ periodSeconds: 1
+ failureThreshold: 5
 restartPolicy: Never

One last time:

$ kubectl apply -f 4-cronjob-with-sidecar-container-and-startup-probe.yaml && kubectl get pods -w
cronjob.batch/sidecar-cronjob created
NAME READY STATUS RESTARTS AGE
sidecar-cronjob-28689977-lt77c 0/2 Pending 0 0s
sidecar-cronjob-28689977-lt77c 0/2 Pending 0 0s
sidecar-cronjob-28689977-lt77c 0/2 Init:0/1 0 0s
sidecar-cronjob-28689977-lt77c 0/2 Init:0/1 0 1s
sidecar-cronjob-28689977-lt77c 0/2 PodInitializing 0 6s
sidecar-cronjob-28689977-lt77c 1/2 PodInitializing 0 6s
sidecar-cronjob-28689977-lt77c 1/2 Completed 0 7s

Hooray!

Bonus: if you don’t have sidecarContainers enabled

If you’re still on Kubernetes 1.28 (or worse) and don’t have the ability to enable alpha featureFlags, you’ll need to find another method.

Unfortunately, the solution will likely involve modifying your main application’s code or its Docker image. You can:

add a retry policy in the sidecar-user application
add a script in the sidecar-user application that waits a bit (sleep) before trying to contact the sidecar

The first is a good practice when dealing with microservices and you should consider it anyway to handle temporary database connection issues.

The second is a band-aid on a wooden leg. I strongly advise against it because startup speed can vary in the sidecar and adding too much delay in the application is also bad when you need to handle incidents and bugs in production (potentially inducing other problems).

Kubernetes on Zwindler's Reflection

Kubernetes 1.34 - Pod-level resources - simplifying resource management when you have lots of containers

Introduction

So, what are Pod-level resources good for?

Let’s test it!

You can also mix pod level and container level

Conclusion

References

Cilium’s new policy log field: our use case

TL;DR

The problem: monitoring all the things (but not too much)

Enter Cilium 1.18’s policy log field

The plan: block telemetry elegantly

Reality check: you can’t have nice things

Why this is a problem

Potential workarounds

Find a way to disable telemetry in the app directly

DNS-based blocking

Use IP-based egressDeny (with maintenance overhead)

Ok, but let’s assume there is no legitimate traffic. Can we use the feature to add a log on dropped traffic?

Conclusion

References

93 ways to deploy Kubernetes: I've cataloged (almost) all existing methods

A slightly crazy documentary project

What does this spreadsheet contain?

The different tool categories

The revelation: everyone copies from their neighbor

A living document

So, what do I still have to test?

zeropod: scale-to-zero with container checkpointing

What is zeropod?

Prerequisites

Installing zeropod

Deploying a WordPress application

Application manifests

Observing the behavior

Detecting the absence of traffic

Under the hood, zeropod logs

Performance tests

Going further: testing with MySQL

Final test: cascading wake-up

Limitations

Recompile Mimir’s "MetaMonitoring" Grafana Dashboards for Kubernetes

Context

Dashboards, you say?

Procedure

Source

Kubernetes resource optimization with horizontal pod autoscaling via custom metrics and Prometheus Adapter

Introduction

What the heck is horizontal pod autoscaling (HPA)???

Prometheus and metrics via /-/metrics

Prometheus Adapter to the rescue

Configuring Prometheus Adapter to expose custom metrics via the API Server

How will this work?

Using a HorizontalPodAutoscaler resource with custom metrics

Using Prometheus Adapter in production

Conclusion

Additional sources

RKE, Talos Linux, ... : MountVolume.NewMounter initialization failed for volume : path does not exist

It all started…

It starts badly

Run, pod! Ruuuuuun!

Is it good now?

The solution

Some links on the subject

Kubernetes 1.29 - sidecar containers - what are they good for?

Introduction

Prerequisites

Without sidecar containers

With sidecar containers

With sidecar containers AND a startupProbe

Bonus: if you don’t have sidecarContainers enabled

Prometheus and metrics via `/-/metrics`