Kubernetes Compliance Policies with OPA and Gatekeeper

Introduction

One of the problems I encountered while managing Kubernetes clusters shared across multiple teams was that the manifests being deployed were not always perfectly controlled.

In some cases, manifests were copy/pasted from other teams, sometimes causing small surprises (no label indicating the team responsible for the app, incorrect Ingress information…). One of the solutions we found was to maintain an up-to-date, as-generic-as-possible Helm Chart, with only values for developers to fill in. This Chart is jointly maintained by Ops and tech leads.

However, maintaining this template is complex, and some applications diverge too much from the general framework to comply, even by forking and modifying parts of it. And some issues like verifying images or registries running on the platform were not addressed by this solution.

Open Policy Agent / Gatekeeper

https://www.thingiverse.com/thing:1236133

Rather than trying at all costs to unify the way applications are deployed in the cluster (and having to manage exceptions), another approach is to add a tool that would verify, at deployment time, that what’s being deployed follows the company’s best practices in terms of security and configuration.

As you may have guessed, the tool for this is Open Policy Agent (or OPA). It’s an open-source generic policy engine (CNCF project at the Incubating stage).

Since it’s a generic engine, there’s also another project, Gatekeeper, which handles the interaction between OPA and Kubernetes.

Prerequisites

To use Gatekeeper, you should have a minimum Kubernetes version of 1.14, which adds webhook timeouts.

If you’re on an earlier version, things get complicated. There’s a bug, fixed in 1.14, that could crash your OPA/Gatekeeper setup. For this reason, if you’re not on 1.14 (and can’t update your platform), you’ll need to skip Gatekeeper (using K8s Policy Controller for instance). That’s a bit trickier and outside the scope of this tutorial.

The “V2”, without Gatekeeper, with Kube-Mgmt and OPA

Of course, you also need admin access on the cluster. Gatekeeper offers a simple way to test your permissions by trying to grant yourself admin rights.

wait for it

kubectl create clusterrolebinding cluster-admin-binding \
    --clusterrole cluster-admin \
    --user [your user]

Quick tip: if you don’t know the name of the access account you’re currently using, you can run this command:

kubectl config view --template='{{ range .contexts }}{{ if eq .name "'$(kubectl config current-context)'" }}Current user: {{ .context.user }}{{ end }}{{ end }}'
Current user: clusterUser_zwindlerk8s_rg_zwindlerk8s

Installation

I set up a small AKS cluster in version 1.14 for the occasion and deployed the prepackaged version of Gatekeeper.

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/deploy/gatekeeper.yaml

Tada! And just like that, Gatekeeper is installed on your cluster. Note: The official documentation provides steps to build your own image.

What’s in my cluster?

Taking a closer look at what the manifest does, here’s what you deploy when running the above command:

a Namespace gatekeeper-system
a ServiceAccount gatekeeper-admin, associated with a Role and a ClusterRole gatekeeper-manager-role via (respectively) a RoleBinding and ClusterRoleBinding gatekeeper-manager-rolebinding
a Deployment gatekeeper-controller-manager and gatekeeper-audit
a Service gatekeeper-webhook-service
a Secret gatekeeper-webhook-server-cert containing a certificate

Nothing exceptional so far. Where it gets interesting is:

two CRDs (Custom Resource Definitions): configs.config.gatekeeper.sh and constrainttemplates.templates.gatekeeper.sh
a validating webhook configuration gatekeeper-validating-webhook-configuration

Validating Webhook

The validation webhook first. The first time I heard about validating webhooks was during the talk 101 Ways to “Break and Recover” Kubernetes Cluster at Kubecon 2018.

The problems these Oath (formerly Yahoo) employees were facing were identical to mine. Some teams, when pulling (a.k.a. copy/pasting) manifests from other teams, forgot to change the Ingress URLs. This resulted in user requests being distributed randomly between 2 completely different services.

The solution proposed at that conference was to use Kubernetes’ Validating/Mutating Admission Webhooks (roughly stable since 1.11).

Back in 2018, when digging into the topic, I found very little documentation about it beyond the official docs (I’ve found more since). Another issue: Admission webhooks require developing your own Controller, which wasn’t very accessible (see links at the end of the article).

Fortunately, Gatekeeper and OPA will handle that for us.

Custom Resource Definition

For those unfamiliar with CRDs, they are simply extensions of the Kubernetes API. The big advantage of CRDs is that they provide a way for third-party vendors to add their own logic inside Kubernetes.

I already talked about CRDs in my article on Rook which allows automated management of Ceph clusters (and more!) via the CephCluster CRD. All common administration tasks are integrated into Kubernetes (even though Kube has no native knowledge of them), managed by a Controller included in Rook, and configured via CRDs.

Gatekeeper will add 3 components to our Kubernetes:

a Config object
a ConstraintTemplate object
a Constraint object

Config

The principle of Gatekeeper, as shown in the following diagram, is that it hooks into your Kubernetes API server and “synchronizes” events for new object creation, looking for objects that don’t comply with your compliance policies.

Source: kubernetes.io/blog

By default, if you don’t configure it, Gatekeeper doesn’t listen to events on any Kubernetes objects… so it will do… absolutely nothing!

Let’s fix that right away by telling Gatekeeper to listen to all events on Ingresses:

cat gatekeeper/demo/basic/sync.yaml
apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
  name: config
  namespace: "gatekeeper-system"
spec:
  sync:
    syncOnly:
      - group: "extensions"
        version: "v1beta1"
        kind: "Ingress"
      - group: "networking.k8s.io"
        version: "v1beta1"
        kind: "Ingress"

ConstraintTemplate

Before you can define a constraint, you must first define a ConstraintTemplate, which describes both the Rego that enforces the constraint and the schema of the constraint.

The idea here is that before you start creating constraints on the cluster, you need to create constraint templates. Think of these templates as the object that links the function (the OPA rego code) on one side with parameters on the other.

This extra step allows us to potentially reuse the same template for several different constraints.

Your first template

For the first time, we can use the sample templates provided on the OPA/Gatekeeper GitHub, which enforces the presence of a label on a given Kube object type, since it’s simpler (k8srequiredlabels_template.yaml).

However, since the beginning of this article, I’ve been talking about preventing teams from using the same FQDN in Ingresses for different services. Because if that were to happen, as a reminder, we’d end up with some kind of janky load balancer redirecting half of user requests to one application and the other half to another for the same URL.

And it just so happens that this template will help us ensure that!

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8suniqueingresshost
spec:
  crd:
    spec:
      names:
        kind: K8sUniqueIngressHost
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8suniqueingresshost
       
        identical(obj, review) {
          obj.metadata.namespace == review.object.metadata.namespace
          obj.metadata.name == review.object.metadata.name
        }

        violation[{"msg": msg}] {
          input.review.kind.kind == "Ingress"
          re_match("^(extensions|networking.k8s.io)$", input.review.kind.group)
          host := input.review.object.spec.rules[_].host
          other := data.inventory.namespace[ns][otherapiversion]["Ingress"][name]
          re_match("^(extensions|networking.k8s.io)/.+$", otherapiversion)
          other.spec.rules[_].host == host
          not identical(other, input.review)
          msg := sprintf("ingress host conflicts with an existing ingress <%v>", [host])
        }

At first glance, it’s pretty dense. But then again, I didn’t pick the simplest example either, and if you re-watch the video Kubecon 2019 | Intro: Open Policy Agent - Rita Zhang, Microsoft & Max Smythe, Google, it’s actually fairly easy to understand.

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/demo/agilebank/dryrun/k8suniqueingresshost_template.yaml
constrainttemplate.templates.gatekeeper.sh/k8suniqueingresshost created

kubectl get constrainttemplate
NAME                       AGE
k8suniqueingresshost       23m

And a Constraint

Now that we have our template, we can simply instantiate it with the variables we’re interested in.

Sticking with the demo example, we ensure that across the entire cluster, no URL is used twice for two different Ingresses by instantiating this constraint:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sUniqueIngressHost 
metadata:
  name: unique-ingress-host
spec:
  enforcementAction: dryrun
  match:
    kinds:
      - apiGroups: ["extensions", "networking.k8s.io"]
        kinds: ["Ingress"]

I’m going to slightly modify this Constraint because, as you can see, it defines an enforcementAction: dryrun, which has no immediate effect (it only logs an error for future audit). It’s a great feature for getting up to speed with OPA, but for my demo it’s less fun…

So I apply the file and change the enforcementAction:

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/demo/agilebank/dryrun/unique-ingress-host.yaml
kubectl patch K8sUniqueIngressHost.constraints.gatekeeper.sh unique-ingress-host -p '{"spec":{"enforcementAction":"deny"}}' --type=merge

Let’s try it out

At this point, we should have everything. Gatekeeper listens to the API server for all events on Ingress objects (via the Config). We have a ConstraintTemplate with rego code that checks we don’t already have an identical URL in existing Ingresses, and a Constraint that defines what to do (dryrun) and on which objects (Ingresses).

Let’s run the demo :D

# Create the namespace that will contain the demo objects
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/demo/agilebank/bad_resources/namespace.yaml
namespace/production created
# Create an Ingress with a URL that doesn't exist yet
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/demo/agilebank/dryrun/existing_resources/example.yaml
ingress.extensions/ingress-host created
# Create a second Ingress with the SAME URL
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/master/demo/agilebank/dryrun/bad_resource/duplicate_ing.yaml

And bam!

[denied by unique-ingress-host] ingress host conflicts with an existing ingress <example-host.example.org>

Conclusion

As I explained in the introduction, adding compliance policies and constraints to a Kubernetes cluster was not trivial.

Without going as far as saying it’s become easy, OPA and Gatekeeper make things simpler (no need to develop and then host your own microservice for each Admission/Validation Webhook). You still need to learn a new language (rego) to start doing cool things, but even with the default templates, there’s already plenty to work with.

Another really interesting point with OPA is the ability to log and audit all compliance errors using dryrun mode instead of deny, without blocking.

kubectl get K8sUniqueIngressHost.constraints.gatekeeper.sh unique-ingress-host -o yaml
[...]
  - enforcementAction: dryrun
    kind: Ingress
    message: ingress host conflicts with an existing ingress <example-host.example.org>
    name: ingress-host2
    namespace: default