Featured image of post Cilium's new policy log field: our use case

Cilium's new policy log field: our use case

Ecrit par ~ zwindler ~

For French readers: Une fois n’est pas coutume, cet article est en anglais car je pense qu’il peut intéresser au-delà de mes lecteurs francophones habituels. J’espère que vous ne m’en voudrez pas 😉.

TL;DR

Cilium 1.18 added a log field to CiliumNetworkPolicies to tag flows with custom labels. Great for filtering out expected blocked traffic from your monitoring dashboards!

But there’s a catch, unrelated to this feature, that made this irrelevant in our use case: you can’t use it with egressDeny + toFQDNs. Here’s why we ran into this wall and what we learned.

The problem: monitoring all the things (but not too much)

Like any good ops team should, we monitor our Kubernetes cluster network flows using Hubble. We (mostly my colleague Nicolas Nativel) push all AUDIT and DROPPED flows to a dashboard so we can quickly spot when something’s blocked and decide:

  • Is this legitimate? → Open the flow
  • Is this suspicious? → Sound the alarm 🚨

This works pretty well… until you start explicitly blocking things that you know should be blocked.

In our case, we wanted to prevent a third-party application from phoning home with its “telemetry” (yeah, let’s call it that 😏). We’re talking about calls to external tracking domains.

The issue? If we just block these flows, they’ll show up as DROPPED in Hubble, trigger our monitoring, and we’ll end up with alerts for something we intentionally blocked.

That’s noise we don’t want.

Enter Cilium 1.18’s policy log field

Good news! Cilium 1.18 introduced exactly what we needed: the ability to add custom log fields to your network policies.

Check out the official announcement.

The idea is simple: you add a log field to your CiliumNetworkPolicy:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: my-policy
spec:
  endpointSelector:
    matchLabels:
      app: my-app
  egress:
    - toFQDNs:
        - matchName: "example.com"
  log:
    value: "my-custom-log-tag"

Then, when you observe flows in Hubble, you can filter them out using CEL (Common Expression Language):

hubble observe \
  --verdict AUDIT \
  --not \
  --cel-expression "(_flow.policy_log.endsWith('my-custom-log-tag'))" \
  --print-raw-filters

Output:

allowlist:
    - '{"verdict":["AUDIT"]}'
denylist:
    - '{"experimental":{"cel_expression":["(_flow.policy_log.endsWith(''my-custom-log-tag''))"]}}'

Perfect! This is exactly what we need. We can now tag our “expected blocks” and exclude them from our monitoring.

The plan: block telemetry elegantly

Armed with this new feature, we crafted our strategy:

  1. Use egressDeny to explicitly block telemetry domains
  2. Add a custom log field: app-explicit-traffic-blocked
  3. Configure Hubble to filter out flows with this tag
  4. Profit! 🎉

Here’s what we tried:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: app-external-block-policy
  namespace: my-namespace
spec:
  endpointSelector:
    matchLabels:
      app.kubernetes.io/name: my-app
  # note: egressDeny takes precedence over egress rules
  # https://docs.cilium.io/en/stable/security/policy/language/#deny-policies
  egressDeny:
    # Block all external traffic and log it with an arbitrary log field
    # This is used to prevent the app from sending telemetry data externally 
    # without triggering an AUDIT/DROPPED alert
    # feature added in cilium 1.18.0 https://github.com/cilium/cilium/pull/39902
    - toFQDNs:
        - matchPattern: "*.telemetry.example.com"
      toPorts:
        - ports:
          - port: "443"
            protocol: TCP
          - port: "80" 
            protocol: TCP
  log:
    value: "app-explicit-traffic-blocked"

This should work, right? We’re using egressDeny (which takes precedence over other allow rules, for legitimate calls, which is good!), and we’re tagging it with our custom log.

Reality check: you can’t have nice things

And then… patatra (as we say in French 🇫🇷).

While reading the Cilium documentation on deny policies, we stumbled upon this little gem:

Deny policies do not support:

  • policy enforcement at L7, i.e., specifically denying an URL
  • toFQDNs, i.e., specifically denying traffic to a specific domain name.

Wait, what?

You cannot use toFQDNs with egressDeny. Our entire plan just collapsed 😱.

Why this is a problem

The issue is the precedence model in Cilium:

  • egressDeny rules take precedence over egress rules (by design, and that’s good!)
  • But if we use egressDeny without toFQDNs, we have to block by IP or CIDR
  • These telemetry services probably use dynamic IPs for their endpoints (good luck maintaining a list…)
  • If we block all 80/443 traffic in egressDeny, we can’t make exceptions for legitimate traffic in egress rules because… deny takes precedence to allow!

We’re stuck between a rock and a hard place:

  • Use egress with toFQDNs → works, but we can’t deny, only allow other traffic
  • Use egressDeny with IPs → we’ll be playing whack-a-mole with rotating IP ranges
  • Use egressDeny to block all 80/443 → we block everything, including legitimate traffic

Potential workarounds

While waiting for Cilium to support toFQDNs in egressDeny policies, here are some alternative approaches you might consider:

Find a way to disable telemetry in the app directly

That’s the best option but sadly not always on the table.

DNS-based blocking

Bend the DNS server to return NXDOMAIN for telemetry domains, like a personal pi-hole server would do with ads. The application will fail to resolve the domain and won’t send data.

Use IP-based egressDeny (with maintenance overhead)

Resolve the telemetry FQDNs to their current IP ranges and block them with egressDeny:

egressDeny:
  - toCIDRSet:
      - cidr: 203.0.113.0/24  # Example telemetry IP range
    toPorts:
      - ports:
        - port: "443"

If the list doesn’t evolve too often, this is a good option.

Ok, but let’s assume there is no legitimate traffic. Can we use the feature to add a log on dropped traffic?

Sadly no, not right now.

There is a bug in this new Cilium feature that only logs the policy_log field on “allowed” flows, not on audit/dropped flows.

When defining a CiliumNetworkPolicy with the spec.log field configured, I expect the relevant hubble flows to have the policy_log field. It works for allowed flow.

But for denied/audited flow resulting from the rule (implicit or explicit), policy_log is never available.

Note: I observe the same issue with --print-policy-names option of hubble, the k8s:io.cilium.k8s.policy.derived-from label is not set for denied flows (but correctly set for allowed flows).

Since 2 tickets are opened and maintainers have started to acknowledge the issue, we can hope this will be fixed, though.

Conclusion

In our use case, we finally didn’t use this new feature from Cilium, but adding details (and allowing filtering on them as well) is always nice.

A shout-out to my colleague Nicolas Nativel, who did most of the work around CiliumNetworkPolicies, including the dashboards, exploratory work on this feature, and took the time to create the issue on the Cilium repository.

References

Licensed under CC BY-SA 4.0

Vous aimez ce blog ou cet article ? Partagez-le avec vos amis !   Twitter Linkedin email Facebook

Vous pouvez également vous abonner à la mailing list des articles ici

L'intégralité du contenu appartenant à Denis Germain (alias zwindler) présent sur ce blog, incluant les textes, le code, les images, les schémas et les supports de talks de conf, sont distribués sous la licence CC BY-SA 4.0.

Les autres contenus (thème du blog, police de caractères, logos d'entreprises, articles invités...) restent soumis à leur propre licence ou à défaut, au droit d'auteur. Plus d'informations dans les Mentions Légales

Généré avec Hugo
Thème Stack conçu par Jimmy