diff --git a/CHANGELOG.md b/CHANGELOG.md index 1c96dcdd2..e08dd1a49 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,7 @@ - **Bumped CloudNative-PG dependency from chart 0.27.0 (app 1.28.0) to 0.28.1 (app 1.29.1)** to pick up the fix for [CVE-2026-44477 / GHSA-423p-g724-fr39](https://github.com/cloudnative-pg/cloudnative-pg/security/advisories/GHSA-423p-g724-fr39): a privilege-escalation vulnerability in the CNPG metrics exporter that could allow a low-privilege PostgreSQL user to escalate to superuser and execute arbitrary commands in the database pod. Operators upgrading via `helm upgrade` will get the patched CNPG operator automatically. ### Major Features +- **io_uring opt-in feature gate**: A new `IOUring` feature gate (`spec.featureGates.IOUring: true`) enables PostgreSQL 18 asynchronous I/O (`io_method=io_uring`). Because the container runtime's default seccomp profile strips the `io_uring_setup/enter/register` syscalls, the operator also relaxes the postgres container seccomp profile — pointing the pods at a hardened Localhost profile that re-allows only those three syscalls — when the gate is on, so no external Kyverno policy is needed. It is **opt-in** (default off) since io_uring relaxes the sandbox. The Localhost profile path is operator-level config via the Helm value `operator.ioUring.seccompProfile` (default `profiles/documentdb-iouring.json`). See [io_uring documentation](docs/operator-public-documentation/io-uring.md) and the [feature playground](documentdb-playground/io-uring-feature/). - **Gateway OTLP metrics in the per-pod sidecar**: when `spec.monitoring.enabled=true`, the OTel Collector sidecar now exposes an OTLP/gRPC receiver on `127.0.0.1:4317` and the documentdb-gateway is configured (via `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_METRICS_ENABLED`) to push its `db_client_*` metrics there. The sidecar's existing prometheus exporter re-exports them alongside the existing `documentdb.postgres.up` sqlquery output, with per-pod attribution added by the collector's resource processor. No new CRD fields; this turns on automatically wherever monitoring was already enabled. - **Two-Phase Extension Upgrade**: New `spec.schemaVersion` field separates binary upgrades (`spec.documentDBVersion`) from irreversible schema migrations (`ALTER EXTENSION UPDATE`). The default behavior gives you a rollback-safe window — update the binary first, validate, then finalize the schema. Set `schemaVersion: "auto"` for single-step upgrades in development environments. See the [upgrade guide](docs/operator-public-documentation/preview/operations/upgrades.md) for details. @@ -15,6 +16,7 @@ - **Removed `Disabled` TLS gateway mode**: The `spec.tls.gateway.mode: Disabled` option has been removed to eliminate the security risk of plaintext Mongo wire protocol traffic. Previously, `Disabled` mode served connections in plaintext, contradicting the `Disabled` tab in `tls.md` which described the mode as a self-signed bootstrap. Empty or unset mode now defaults to `SelfSigned`, and the controller fails closed (also defaulting to `SelfSigned`) if a legacy `Disabled` value is encountered on a stored object. Users with `mode: Disabled` should remove this setting or explicitly set `mode: SelfSigned` — the gateway will automatically use a cert-manager generated self-signed certificate. See [issue #356](https://github.com/documentdb/documentdb-kubernetes-operator/issues/356) for details. ### Playground & Examples +- **io_uring feature playground**: New `documentdb-playground/io-uring-feature/` demonstrates the operator-native `IOUring` opt-in, modeled on the upstream cnpg-playground seccomp approach — a kind cluster that `extraMount`s the curated Localhost seccomp profile, a DaemonSet installer for real clusters, the DocumentDB CR with `spec.featureGates.IOUring: true`, and verification steps. No Kyverno policy is required. - **Container metrics reference collector**: The telemetry playground now includes a reference OpenTelemetry Collector DaemonSet under `documentdb-playground/telemetry/container-metrics/` for clusters that do not already collect kubelet-backed container metrics. It scrapes each node's local kubelet for container, pod, and node CPU/memory/network/filesystem metrics and exposes them via Prometheus. The production operator chart does not install this platform-level collector; tenant DocumentDB clusters do not receive kubelet privileges. ### Testing infrastructure diff --git a/docs/operator-public-documentation/io-uring.md b/docs/operator-public-documentation/io-uring.md new file mode 100644 index 000000000..60d75ded4 --- /dev/null +++ b/docs/operator-public-documentation/io-uring.md @@ -0,0 +1,174 @@ +--- +title: io_uring async I/O feature gate +description: Enable PostgreSQL 18 asynchronous I/O (io_method=io_uring) in DocumentDB through the IOUring feature gate, including the seccomp trade-offs, operator configuration, prerequisites, verification, and troubleshooting. +tags: + - configuration + - feature-gates + - performance + - security + - io_uring +--- + +# io_uring async I/O feature gate + +The `IOUring` feature gate enables PostgreSQL 18's asynchronous I/O backend (`io_method=io_uring`) for a DocumentDB cluster. Because io_uring requires relaxing the container's seccomp sandbox, the feature is **opt-in** and disabled by default. + +## Overview + +PostgreSQL 18 introduces a pluggable asynchronous I/O subsystem. On Linux, the `io_uring` backend submits read I/O through the kernel's [io_uring](https://en.wikipedia.org/wiki/Io_uring) interface, which overlaps storage latency instead of blocking on each read. + +DocumentDB doesn't turn this on for you automatically, for one reason: **security**. io_uring has been a recurring kernel-exploit surface, so the container runtime's `RuntimeDefault` seccomp profile strips the `io_uring_setup`, `io_uring_enter`, and `io_uring_register` syscalls. CloudNative-PG (CNPG) runs the PostgreSQL pods with `seccompProfile=RuntimeDefault`, so without intervention PostgreSQL crashes at startup with: + +```text +FATAL: could not setup io_uring queue: Operation not permitted +``` + +Enabling io_uring therefore means relaxing seccomp — a security trade-off that the Kubernetes cluster operator must consciously accept. DocumentDB makes that choice explicit through the `IOUring` feature gate rather than enabling it silently. + +!!! note + The `IOUring` gate controls one DocumentDB cluster. The seccomp *profile* (which Localhost profile the operator points the pods at) is configured once on the operator and applies to every DocumentDB cluster it manages. See [Seccomp configuration](#seccomp-configuration). + +## What enabling the gate does + +When you set `spec.featureGates.IOUring: true`, the operator does two things natively — **no external Kyverno policy or admission webhook is required**: + +1. **Sets `io_method=io_uring`** as a protected PostgreSQL parameter. This value can't be overridden through `spec.postgres.parameters`. (See [PostgreSQL parameter tuning](postgresql-tuning.md) for how protected parameters work.) +2. **Relaxes the PostgreSQL container's seccomp profile** so the three io_uring syscalls are allowed. The operator points the CNPG cluster's pod security context at a Localhost seccomp profile (see [Seccomp configuration](#seccomp-configuration)). + +When the gate is disabled (the default), the operator changes nothing and CNPG keeps its hardened `RuntimeDefault` profile. + +## How to enable + +Add the feature gate to your DocumentDB custom resource: + +```yaml title="documentdb.yaml" +apiVersion: documentdb.io/preview +kind: DocumentDB +metadata: + name: my-documentdb +spec: + nodeCount: 1 + instancesPerNode: 1 + resource: + storage: + pvcSize: "50Gi" + featureGates: + IOUring: true # (1)! +``` + +1. Opt in to PostgreSQL 18 `io_method=io_uring`. The operator also relaxes the PostgreSQL container seccomp profile using the operator-level Localhost profile. + +For the full field reference, see [DocumentDBSpec](preview/api-reference.md#documentdbspec) in the API Reference. + +## Seccomp configuration + +Which Localhost seccomp profile the operator points the pods at is **operator-level configuration**, set through an environment variable on the operator deployment. The same profile applies to **all** DocumentDB clusters managed by that operator. + +| Environment variable | Values | Default | Description | +|----------------------|--------|---------|-------------| +| `DOCUMENTDB_IOURING_SECCOMP_PROFILE` | profile path | `profiles/documentdb-iouring.json` | Localhost profile path, relative to the node's kubelet seccomp root (`/var/lib/kubelet/seccomp`). | + +With the bundled Helm chart, set this through a first-class value (preferred): + +```bash +helm upgrade --install documentdb-operator -n documentdb-operator \ + --set operator.ioUring.seccompProfile=profiles/documentdb-iouring.json +``` + +Leaving the value empty keeps the operator's built-in default +(`profiles/documentdb-iouring.json`). For an already-installed operator you can patch +the manager container env directly instead: + +```yaml title="operator-deployment.yaml (excerpt)" +spec: + template: + spec: + containers: + - name: documentdb-operator + env: + - name: DOCUMENTDB_IOURING_SECCOMP_PROFILE + value: "profiles/documentdb-iouring.json" +``` + +The operator points the PostgreSQL pods at a **Localhost** seccomp profile that re-allows only the three io_uring syscalls on top of the runtime default. This keeps the rest of the sandbox intact. + +The referenced profile JSON — the upstream `RuntimeDefault` profile **plus** `io_uring_setup`, `io_uring_enter`, and `io_uring_register` — **must be pre-installed on every node that runs PostgreSQL pods**, at the path resolved under `/var/lib/kubelet/seccomp`. If the profile is missing on a node, the pod scheduled there fails to start. + +The hands-on [io_uring feature playground](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/io-uring-feature) provides the curated profile plus a kind `extraMount` and a DaemonSet installer that distribute it to every node. + +!!! warning "Security trade-off" + Relaxing seccomp — even with the hardened Localhost profile that re-allows only the three io_uring syscalls — widens the kernel attack surface. io_uring has been a recurring kernel-exploit vector, so this is a trade-off you accept as the cluster operator. That is why the gate is opt-in and disabled by default. + +## Prerequisites + +- **PostgreSQL 18 image.** `io_method=io_uring` exists only in PostgreSQL 18 and later. Make sure the cluster runs a PG18 image. +- **A node kernel with io_uring enabled.** The nodes must run a kernel that exposes io_uring with `io_uring_disabled=0`. Modern AKS, EKS, and GKE node images qualify. +- **The seccomp profile installed on nodes.** The Localhost profile referenced by `DOCUMENTDB_IOURING_SECCOMP_PROFILE` must exist on every node that runs PostgreSQL pods. The [io_uring feature playground](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/io-uring-feature) automates this. + +## Verification + +After enabling the gate and waiting for the rolling restart to finish, confirm io_uring is active. + +1. **Check that the PostgreSQL pods are running** (not CrashLooping): + + ```bash + kubectl get pods -n -l documentdb.io/cluster= + ``` + +2. **Confirm `io_method` is `io_uring`** by connecting to PostgreSQL: + + ```bash + kubectl exec -it -n -c postgres -- \ + psql -U postgres -c "SHOW io_method;" + ``` + + Expected output: + + ```text + io_method + ----------- + io_uring + (1 row) + ``` + +3. **Inspect the pod's seccomp profile** to confirm the operator relaxed it: + + ```bash + kubectl get pod -n \ + -o jsonpath='{.spec.securityContext.seccompProfile}' + ``` + + This shows `{"type":"Localhost","localhostProfile":"profiles/documentdb-iouring.json"}`. + +4. **Confirm reads are flowing through the I/O path** with `pg_stat_io`: + + ```bash + kubectl exec -it -n -c postgres -- \ + psql -U postgres -c "SELECT backend_type, object, context, reads FROM pg_stat_io WHERE reads > 0;" + ``` + +## Performance + +io_uring's measured benefit is primarily **tail-latency stability on I/O-bound scans**, not raw throughput. On Azure Premium SSD at low concurrency, `io_method=io_uring` delivers lower, more predictable p95/p99 latency on heavy range scans and reduces in-engine read-wait time, while point lookups and aggregate throughput are largely unchanged. + +## Troubleshooting + +### PostgreSQL CrashLoops with "could not setup io_uring queue" + +If a PostgreSQL pod restarts repeatedly and its logs show: + +```text +FATAL: could not setup io_uring queue: Operation not permitted +``` + +the seccomp profile wasn't relaxed for that pod. Check, in order: + +- **Profile not installed on the node.** The profile JSON is missing on the node where the pod is scheduled. Install it on every node that runs PostgreSQL pods — the [io_uring feature playground](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/io-uring-feature) DaemonSet handles this. +- **Wrong profile path.** `DOCUMENTDB_IOURING_SECCOMP_PROFILE` doesn't match the actual file path under `/var/lib/kubelet/seccomp` on the node. Align the env var with the installed file. +- **Operator not restarted.** The profile env var changed but the operator deployment wasn't rolled, so new clusters still reference the old path. Verify the pod's seccomp profile with the [verification](#verification) command above. + +## Related + +- [PostgreSQL parameter tuning](postgresql-tuning.md) — how protected parameters such as `io_method` are managed +- [API Reference: DocumentDBSpec](preview/api-reference.md#documentdbspec) — the `featureGates` field +- [io_uring feature playground](https://github.com/documentdb/documentdb-kubernetes-operator/tree/main/documentdb-playground/io-uring-feature) — kind `extraMount`, DaemonSet installer, and curated profile diff --git a/documentdb-playground/io-uring-feature/README.md b/documentdb-playground/io-uring-feature/README.md new file mode 100644 index 000000000..590a00636 --- /dev/null +++ b/documentdb-playground/io-uring-feature/README.md @@ -0,0 +1,153 @@ +# IOUring Feature Gate Playground + +Demonstrate the DocumentDB operator's native `IOUring` feature gate. This is the +supported opt-in path for PostgreSQL 18 `io_method = io_uring` in DocumentDB. + +`io_uring` can improve heavy read-I/O paths, but it is also a recurring Linux +kernel exploit surface. Container runtimes therefore remove the +`io_uring_setup`, `io_uring_enter`, and `io_uring_register` syscalls from +`RuntimeDefault` seccomp profiles. The feature is opt-in so clusters keep the +hardened default unless an operator admin deliberately enables it. + +When `spec.featureGates.IOUring: true` is set on a `DocumentDB` resource, the +operator does two things: + +1. Sets PostgreSQL `io_method=io_uring` on the generated CNPG `Cluster`. +2. Relaxes the postgres pod seccomp profile by pointing it at a **Localhost** + profile that re-allows only the three io_uring syscalls. The profile path is + operator-level config (`DOCUMENTDB_IOURING_SECCOMP_PROFILE`, default + `profiles/documentdb-iouring.json`) and the profile must be installed on the + nodes. + +No Kyverno mutation policy is needed here; the DocumentDB operator owns both the +PostgreSQL parameter and the seccomp wiring. + +```mermaid +flowchart LR + DB[DocumentDB CR
featureGates.IOUring=true] --> OP[DocumentDB operator] + OP -->|io_method=io_uring| CNPG[CNPG Cluster] + OP -->|seccompProfile
Localhost| CNPG + CNPG --> PG[PostgreSQL 18 pod] + PG --> PVC[(PVC / storage)] +``` + +## Prerequisites + +- DocumentDB operator version that includes the native `IOUring` feature gate. +- PostgreSQL 18 image (`ghcr.io/cloudnative-pg/postgresql:18-minimal-trixie` in + `manifests/documentdb-iouring.yaml`). +- Linux nodes with `io_uring_disabled=0`: + ```bash + kubectl debug node/ -it --image=busybox:1.36 -- chroot /host cat /proc/sys/kernel/io_uring_disabled + ``` +- The Localhost profile installed on every node that can run postgres pods at: + `/var/lib/kubelet/seccomp/profiles/documentdb-iouring.json`. + +The operator's built-in default already points at +`profiles/documentdb-iouring.json`, so once that profile is on the nodes you only +need to enable the gate — no operator env config is required. + +## Quick start: local kind cluster + +Run from this directory so the kind `extraMounts` path resolves to `./seccomp`: + +```bash +cd documentdb-playground/io-uring-feature + +kind create cluster --config kind/kind-cluster.yaml + +# Install cert-manager + the DocumentDB operator per the project docs. No io_uring +# env config is needed: localhost is the only mode and the default profile path +# already matches the mounted profile. + +kubectl apply -f manifests/documentdb-iouring.yaml +``` + +The kind config mounts `./seccomp/documentdb-iouring.json` into each node at the +operator's default Localhost profile path, so the DaemonSet installer is not +needed for kind. + +## Quick start: real cluster + +```bash +cd documentdb-playground/io-uring-feature + +# Install the profile on every node. +kubectl apply -k seccomp/ +kubectl rollout status ds/documentdb-iouring-seccomp-installer -n kube-system --timeout=180s + +# Install cert-manager + the DocumentDB operator per the project docs. The default +# profile path (profiles/documentdb-iouring.json) matches the installed profile. + +kubectl apply -f manifests/documentdb-iouring.yaml +``` + +To use a different profile path, set it via the first-class Helm value +`--set operator.ioUring.seccompProfile=` on install, or patch an +already-installed operator: + +```bash +kubectl patch deployment documentdb-operator -n documentdb-operator \ + --type strategic --patch-file operator-values/seccomp-profile-patch.yaml +kubectl rollout status deployment/documentdb-operator -n documentdb-operator +``` + +## Verification + +Wait for the generated CNPG cluster and primary pod: + +```bash +kubectl get documentdb -n iouring-demo iouring-demo +kubectl get cluster.postgresql.cnpg.io -n iouring-demo iouring-demo +kubectl get pods -n iouring-demo -l cnpg.io/cluster=iouring-demo + +POD=$(kubectl get pod -n iouring-demo \ + -l cnpg.io/cluster=iouring-demo,cnpg.io/instanceRole=primary \ + -o jsonpath='{.items[0].metadata.name}') +``` + +Confirm PostgreSQL is using `io_uring`: + +```bash +kubectl exec -n iouring-demo "$POD" -c postgres -- \ + psql -U postgres -tAc 'SHOW io_method;' +# expected: io_uring +``` + +Confirm seccomp was set by the operator: + +```bash +kubectl get cluster.postgresql.cnpg.io iouring-demo -n iouring-demo \ + -o jsonpath='{.spec.seccompProfile}{"\n"}' +kubectl get pod -n iouring-demo "$POD" \ + -o jsonpath='{.spec.securityContext.seccompProfile}{"\n"}' +# expected: {"type":"Localhost","localhostProfile":"profiles/documentdb-iouring.json"} +``` + +Confirm postgres is not crashlooping and `pg_stat_io` reads are visible: + +```bash +kubectl get pod -n iouring-demo "$POD" \ + -o jsonpath='{range .status.containerStatuses[*]}{.name}{" restarts="}{.restartCount}{" ready="}{.ready}{"\n"}{end}' + +kubectl exec -n iouring-demo "$POD" -c postgres -- psql -U postgres -c \ + "SELECT backend_type, object, context, reads, read_time FROM pg_stat_io WHERE reads > 0 ORDER BY reads DESC LIMIT 10;" +``` + +## Troubleshooting + +| Symptom | Likely cause | Fix | +|---|---|---| +| `FATAL: could not setup io_uring queue: Operation not permitted` | RuntimeDefault still blocks the io_uring syscalls | Ensure `spec.featureGates.IOUring: true` and that the Localhost profile is installed on the node; recreate/restart postgres pods. | +| Same crash, profile installed | Profile missing or wrong path on a node | Verify `/var/lib/kubelet/seccomp/profiles/documentdb-iouring.json` exists on every node, or set `DOCUMENTDB_IOURING_SECCOMP_PROFILE` to the installed relative path. | +| `SHOW io_method;` is not `io_uring` | Feature gate not applied or old operator version | Check `kubectl get documentdb -n iouring-demo iouring-demo -o yaml` and operator logs. | + +## File reference + +| Path | Purpose | +|---|---| +| `manifests/documentdb-iouring.yaml` | Namespace, demo credentials Secret, and `DocumentDB` CR with `featureGates.IOUring: true`. | +| `seccomp/documentdb-iouring.json` | Curated RuntimeDefault-equivalent profile plus `io_uring_*` syscalls. | +| `seccomp/deploy-seccomp-daemonset.yaml` + `seccomp/kustomization.yaml` | Installs the profile to real cluster nodes. | +| `kind/kind-cluster.yaml` | Local kind cluster config that mounts `./seccomp` into kubelet's Localhost profile directory. | +| `operator-values/seccomp-profile-patch.yaml` | Optional deployment patch to override `DOCUMENTDB_IOURING_SECCOMP_PROFILE`. | diff --git a/documentdb-playground/io-uring-feature/kind/kind-cluster.yaml b/documentdb-playground/io-uring-feature/kind/kind-cluster.yaml new file mode 100644 index 000000000..3c30c10be --- /dev/null +++ b/documentdb-playground/io-uring-feature/kind/kind-cluster.yaml @@ -0,0 +1,25 @@ +# Run from documentdb-playground/io-uring-feature: +# kind create cluster --config kind/kind-cluster.yaml +# +# Mounts ./seccomp/documentdb-iouring.json into each kind node at the kubelet +# Localhost seccomp profile path used by the operator default: +# /var/lib/kubelet/seccomp/profiles/documentdb-iouring.json +apiVersion: kind.x-k8s.io/v1alpha4 +kind: Cluster +name: documentdb-iouring +nodes: + - role: control-plane + extraMounts: + - hostPath: ./seccomp + containerPath: /var/lib/kubelet/seccomp/profiles + readOnly: true + - role: worker + extraMounts: + - hostPath: ./seccomp + containerPath: /var/lib/kubelet/seccomp/profiles + readOnly: true + - role: worker + extraMounts: + - hostPath: ./seccomp + containerPath: /var/lib/kubelet/seccomp/profiles + readOnly: true diff --git a/documentdb-playground/io-uring-feature/manifests/documentdb-iouring.yaml b/documentdb-playground/io-uring-feature/manifests/documentdb-iouring.yaml new file mode 100644 index 000000000..675214d49 --- /dev/null +++ b/documentdb-playground/io-uring-feature/manifests/documentdb-iouring.yaml @@ -0,0 +1,50 @@ +apiVersion: v1 +kind: Namespace +metadata: + name: iouring-demo + labels: + app.kubernetes.io/part-of: io-uring-feature +--- +# Throwaway gateway credentials for the IOUring feature playground. +# Replace before using outside a local demo cluster. +apiVersion: v1 +kind: Secret +metadata: + name: iouring-demo-credentials + namespace: iouring-demo + labels: + app.kubernetes.io/part-of: io-uring-feature +type: Opaque +stringData: + username: demo + password: "ChangeMe!ReplaceBeforeUsing" +--- +apiVersion: documentdb.io/preview +kind: DocumentDB +metadata: + name: iouring-demo + namespace: iouring-demo + labels: + app.kubernetes.io/part-of: io-uring-feature +spec: + nodeCount: 1 + instancesPerNode: 1 + documentDbCredentialSecret: iouring-demo-credentials + image: + postgres: ghcr.io/cloudnative-pg/postgresql:18-minimal-trixie + resource: + storage: + pvcSize: "50Gi" + # storageClass: managed-csi-premium + persistentVolumeReclaimPolicy: Delete + memory: "4Gi" + cpu: "2" + postgres: + parameters: + track_io_timing: "on" + featureGates: + IOUring: true + exposeViaService: + serviceType: ClusterIP + plugins: + sidecarInjectorName: cnpg-i-sidecar-injector.documentdb.io diff --git a/documentdb-playground/io-uring-feature/operator-values/seccomp-profile-patch.yaml b/documentdb-playground/io-uring-feature/operator-values/seccomp-profile-patch.yaml new file mode 100644 index 000000000..139e217a0 --- /dev/null +++ b/documentdb-playground/io-uring-feature/operator-values/seccomp-profile-patch.yaml @@ -0,0 +1,15 @@ +# OPTIONAL strategic-merge patch for an ALREADY-INSTALLED operator that needs a +# CUSTOM Localhost seccomp profile path. The operator already defaults to +# profiles/documentdb-iouring.json, so this is only needed to override that path. +# For fresh installs prefer the first-class Helm value: +# --set operator.ioUring.seccompProfile=profiles/documentdb-iouring.json +# Apply with: +# kubectl patch deployment documentdb-operator -n documentdb-operator --type strategic --patch-file operator-values/seccomp-profile-patch.yaml +spec: + template: + spec: + containers: + - name: documentdb-operator + env: + - name: DOCUMENTDB_IOURING_SECCOMP_PROFILE + value: profiles/documentdb-iouring.json diff --git a/documentdb-playground/io-uring-feature/seccomp/deploy-seccomp-daemonset.yaml b/documentdb-playground/io-uring-feature/seccomp/deploy-seccomp-daemonset.yaml new file mode 100644 index 000000000..661f10693 --- /dev/null +++ b/documentdb-playground/io-uring-feature/seccomp/deploy-seccomp-daemonset.yaml @@ -0,0 +1,59 @@ +# DaemonSet that installs the IOUring Localhost seccomp profile onto every node: +# /var/lib/kubelet/seccomp/profiles/documentdb-iouring.json +# +# The DocumentDB operator's localhost mode references this default profile as: +# localhostProfile: profiles/documentdb-iouring.json +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: documentdb-iouring-seccomp-installer + namespace: kube-system + labels: + app.kubernetes.io/name: documentdb-iouring-seccomp-installer + app.kubernetes.io/part-of: io-uring-feature +spec: + selector: + matchLabels: + app.kubernetes.io/name: documentdb-iouring-seccomp-installer + template: + metadata: + labels: + app.kubernetes.io/name: documentdb-iouring-seccomp-installer + app.kubernetes.io/part-of: io-uring-feature + spec: + tolerations: + - operator: Exists + initContainers: + - name: install-profile + image: busybox:1.36 + command: ["/bin/sh", "-c"] + args: + - | + set -eu + install -D -m 0644 /profile/documentdb-iouring.json /host-seccomp/profiles/documentdb-iouring.json + echo "installed profiles/documentdb-iouring.json on $(hostname)" + volumeMounts: + - name: profile + mountPath: /profile + readOnly: true + - name: host-seccomp + mountPath: /host-seccomp + containers: + - name: pause + image: busybox:1.36 + command: ["/bin/sh", "-c", "trap 'exit 0' TERM; while true; do sleep 3600 & wait $!; done"] + resources: + requests: + cpu: 10m + memory: 16Mi + limits: + cpu: 50m + memory: 32Mi + volumes: + - name: profile + configMap: + name: documentdb-iouring-seccomp-profile + - name: host-seccomp + hostPath: + path: /var/lib/kubelet/seccomp + type: DirectoryOrCreate diff --git a/documentdb-playground/io-uring-feature/seccomp/documentdb-iouring.json b/documentdb-playground/io-uring-feature/seccomp/documentdb-iouring.json new file mode 100644 index 000000000..fd76a62cc --- /dev/null +++ b/documentdb-playground/io-uring-feature/seccomp/documentdb-iouring.json @@ -0,0 +1,798 @@ +{ + "archMap": [ + { + "architecture": "SCMP_ARCH_X86_64", + "subArchitectures": [ + "SCMP_ARCH_X86", + "SCMP_ARCH_X32" + ] + }, + { + "architecture": "SCMP_ARCH_AARCH64", + "subArchitectures": [ + "SCMP_ARCH_ARM" + ] + }, + { + "architecture": "SCMP_ARCH_MIPS64", + "subArchitectures": [ + "SCMP_ARCH_MIPS", + "SCMP_ARCH_MIPS64N32" + ] + }, + { + "architecture": "SCMP_ARCH_MIPS64N32", + "subArchitectures": [ + "SCMP_ARCH_MIPS", + "SCMP_ARCH_MIPS64" + ] + }, + { + "architecture": "SCMP_ARCH_MIPSEL64", + "subArchitectures": [ + "SCMP_ARCH_MIPSEL", + "SCMP_ARCH_MIPSEL64N32" + ] + }, + { + "architecture": "SCMP_ARCH_MIPSEL64N32", + "subArchitectures": [ + "SCMP_ARCH_MIPSEL", + "SCMP_ARCH_MIPSEL64" + ] + }, + { + "architecture": "SCMP_ARCH_S390X", + "subArchitectures": [ + "SCMP_ARCH_S390" + ] + } + ], + "defaultAction": "SCMP_ACT_ERRNO", + "syscalls": [ + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": {}, + "names": [ + "accept", + "accept4", + "access", + "adjtimex", + "alarm", + "bind", + "brk", + "capget", + "capset", + "chdir", + "chmod", + "chown", + "chown32", + "clock_getres", + "clock_gettime", + "clock_nanosleep", + "close", + "connect", + "copy_file_range", + "creat", + "dup", + "dup2", + "dup3", + "epoll_create", + "epoll_create1", + "epoll_ctl", + "epoll_ctl_old", + "epoll_pwait", + "epoll_wait", + "epoll_wait_old", + "eventfd", + "eventfd2", + "execve", + "execveat", + "exit", + "exit_group", + "faccessat", + "fadvise64", + "fadvise64_64", + "fallocate", + "fanotify_mark", + "fchdir", + "fchmod", + "fchmodat", + "fchown", + "fchown32", + "fchownat", + "fcntl", + "fcntl64", + "fdatasync", + "fgetxattr", + "flistxattr", + "flock", + "fork", + "fremovexattr", + "fsetxattr", + "fstat", + "fstat64", + "fstatat64", + "fstatfs", + "fstatfs64", + "fsync", + "ftruncate", + "ftruncate64", + "futex", + "futimesat", + "getcpu", + "getcwd", + "getdents", + "getdents64", + "getegid", + "getegid32", + "geteuid", + "geteuid32", + "getgid", + "getgid32", + "getgroups", + "getgroups32", + "getitimer", + "getpeername", + "getpgid", + "getpgrp", + "getpid", + "getppid", + "getpriority", + "getrandom", + "getresgid", + "getresgid32", + "getresuid", + "getresuid32", + "getrlimit", + "get_robust_list", + "getrusage", + "getsid", + "getsockname", + "getsockopt", + "get_thread_area", + "gettid", + "gettimeofday", + "getuid", + "getuid32", + "getxattr", + "inotify_add_watch", + "inotify_init", + "inotify_init1", + "inotify_rm_watch", + "io_cancel", + "ioctl", + "io_destroy", + "io_getevents", + "io_pgetevents", + "ioprio_get", + "ioprio_set", + "io_setup", + "io_submit", + "io_uring_enter", + "io_uring_register", + "io_uring_setup", + "ipc", + "kill", + "lchown", + "lchown32", + "lgetxattr", + "link", + "linkat", + "listen", + "listxattr", + "llistxattr", + "_llseek", + "lremovexattr", + "lseek", + "lsetxattr", + "lstat", + "lstat64", + "madvise", + "memfd_create", + "mincore", + "mkdir", + "mkdirat", + "mknod", + "mknodat", + "mlock", + "mlock2", + "mlockall", + "mmap", + "mmap2", + "mprotect", + "mq_getsetattr", + "mq_notify", + "mq_open", + "mq_timedreceive", + "mq_timedsend", + "mq_unlink", + "mremap", + "msgctl", + "msgget", + "msgrcv", + "msgsnd", + "msync", + "munlock", + "munlockall", + "munmap", + "nanosleep", + "newfstatat", + "_newselect", + "open", + "openat", + "pause", + "pipe", + "pipe2", + "poll", + "ppoll", + "prctl", + "pread64", + "preadv", + "preadv2", + "prlimit64", + "pselect6", + "pwrite64", + "pwritev", + "pwritev2", + "read", + "readahead", + "readlink", + "readlinkat", + "readv", + "recv", + "recvfrom", + "recvmmsg", + "recvmsg", + "remap_file_pages", + "removexattr", + "rename", + "renameat", + "renameat2", + "restart_syscall", + "rmdir", + "rt_sigaction", + "rt_sigpending", + "rt_sigprocmask", + "rt_sigqueueinfo", + "rt_sigreturn", + "rt_sigsuspend", + "rt_sigtimedwait", + "rt_tgsigqueueinfo", + "sched_getaffinity", + "sched_getattr", + "sched_getparam", + "sched_get_priority_max", + "sched_get_priority_min", + "sched_getscheduler", + "sched_rr_get_interval", + "sched_setaffinity", + "sched_setattr", + "sched_setparam", + "sched_setscheduler", + "sched_yield", + "seccomp", + "select", + "semctl", + "semget", + "semop", + "semtimedop", + "send", + "sendfile", + "sendfile64", + "sendmmsg", + "sendmsg", + "sendto", + "setfsgid", + "setfsgid32", + "setfsuid", + "setfsuid32", + "setgid", + "setgid32", + "setgroups", + "setgroups32", + "setitimer", + "setpgid", + "setpriority", + "setregid", + "setregid32", + "setresgid", + "setresgid32", + "setresuid", + "setresuid32", + "setreuid", + "setreuid32", + "setrlimit", + "set_robust_list", + "setsid", + "setsockopt", + "set_thread_area", + "set_tid_address", + "setuid", + "setuid32", + "setxattr", + "shmat", + "shmctl", + "shmdt", + "shmget", + "shutdown", + "sigaltstack", + "signalfd", + "signalfd4", + "sigprocmask", + "sigreturn", + "socket", + "socketcall", + "socketpair", + "splice", + "stat", + "stat64", + "statfs", + "statfs64", + "statx", + "symlink", + "symlinkat", + "sync", + "sync_file_range", + "syncfs", + "sysinfo", + "tee", + "tgkill", + "time", + "timer_create", + "timer_delete", + "timerfd_create", + "timerfd_gettime", + "timerfd_settime", + "timer_getoverrun", + "timer_gettime", + "timer_settime", + "times", + "tkill", + "truncate", + "truncate64", + "ugetrlimit", + "umask", + "uname", + "unlink", + "unlinkat", + "utime", + "utimensat", + "utimes", + "vfork", + "vmsplice", + "wait4", + "waitid", + "waitpid", + "write", + "writev" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": null, + "comment": "", + "excludes": {}, + "includes": { + "minKernel": "4.8" + }, + "names": [ + "ptrace" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 0, + "op": "SCMP_CMP_EQ", + "value": 0, + "valueTwo": 0 + } + ], + "comment": "", + "excludes": {}, + "includes": {}, + "names": [ + "personality" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 0, + "op": "SCMP_CMP_EQ", + "value": 8, + "valueTwo": 0 + } + ], + "comment": "", + "excludes": {}, + "includes": {}, + "names": [ + "personality" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 0, + "op": "SCMP_CMP_EQ", + "value": 131072, + "valueTwo": 0 + } + ], + "comment": "", + "excludes": {}, + "includes": {}, + "names": [ + "personality" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 0, + "op": "SCMP_CMP_EQ", + "value": 131080, + "valueTwo": 0 + } + ], + "comment": "", + "excludes": {}, + "includes": {}, + "names": [ + "personality" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 0, + "op": "SCMP_CMP_EQ", + "value": 4294967295, + "valueTwo": 0 + } + ], + "comment": "", + "excludes": {}, + "includes": {}, + "names": [ + "personality" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "arches": [ + "ppc64le" + ] + }, + "names": [ + "sync_file_range2" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "arches": [ + "arm", + "arm64" + ] + }, + "names": [ + "arm_fadvise64_64", + "arm_sync_file_range", + "sync_file_range2", + "breakpoint", + "cacheflush", + "set_tls" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "arches": [ + "amd64", + "x32" + ] + }, + "names": [ + "arch_prctl" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "arches": [ + "amd64", + "x32", + "x86" + ] + }, + "names": [ + "modify_ldt" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "arches": [ + "s390", + "s390x" + ] + }, + "names": [ + "s390_pci_mmio_read", + "s390_pci_mmio_write", + "s390_runtime_instr" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_DAC_READ_SEARCH" + ] + }, + "names": [ + "open_by_handle_at" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_ADMIN" + ] + }, + "names": [ + "bpf", + "clone", + "fanotify_init", + "lookup_dcookie", + "mount", + "name_to_handle_at", + "perf_event_open", + "quotactl", + "setdomainname", + "sethostname", + "setns", + "syslog", + "umount", + "umount2", + "unshare" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 0, + "op": "SCMP_CMP_MASKED_EQ", + "value": 2114060288, + "valueTwo": 0 + } + ], + "comment": "", + "excludes": { + "arches": [ + "s390", + "s390x" + ], + "caps": [ + "CAP_SYS_ADMIN" + ] + }, + "includes": {}, + "names": [ + "clone" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [ + { + "index": 1, + "op": "SCMP_CMP_MASKED_EQ", + "value": 2114060288, + "valueTwo": 0 + } + ], + "comment": "s390 parameter ordering for clone is different", + "excludes": { + "caps": [ + "CAP_SYS_ADMIN" + ] + }, + "includes": { + "arches": [ + "s390", + "s390x" + ] + }, + "names": [ + "clone" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_BOOT" + ] + }, + "names": [ + "reboot" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_CHROOT" + ] + }, + "names": [ + "chroot" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_MODULE" + ] + }, + "names": [ + "delete_module", + "init_module", + "finit_module", + "query_module" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_PACCT" + ] + }, + "names": [ + "acct" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_PTRACE" + ] + }, + "names": [ + "kcmp", + "process_vm_readv", + "process_vm_writev", + "ptrace" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_RAWIO" + ] + }, + "names": [ + "iopl", + "ioperm" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_TIME" + ] + }, + "names": [ + "settimeofday", + "stime", + "clock_settime" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_TTY_CONFIG" + ] + }, + "names": [ + "vhangup" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYS_NICE" + ] + }, + "names": [ + "get_mempolicy", + "mbind", + "set_mempolicy" + ] + }, + { + "action": "SCMP_ACT_ALLOW", + "args": [], + "comment": "", + "excludes": {}, + "includes": { + "caps": [ + "CAP_SYSLOG" + ] + }, + "names": [ + "syslog" + ] + } + ] +} \ No newline at end of file diff --git a/documentdb-playground/io-uring-feature/seccomp/kustomization.yaml b/documentdb-playground/io-uring-feature/seccomp/kustomization.yaml new file mode 100644 index 000000000..ff84cea50 --- /dev/null +++ b/documentdb-playground/io-uring-feature/seccomp/kustomization.yaml @@ -0,0 +1,21 @@ +# Apply with: kubectl apply -k seccomp/ +# +# Generates the curated IOUring seccomp profile ConfigMap and deploys the +# installer DaemonSet that writes it to each node's kubelet seccomp dir. +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: kube-system + +configMapGenerator: + - name: documentdb-iouring-seccomp-profile + files: + - documentdb-iouring.json=documentdb-iouring.json + +generatorOptions: + disableNameSuffixHash: true + labels: + app.kubernetes.io/part-of: io-uring-feature + +resources: + - deploy-seccomp-daemonset.yaml diff --git a/mkdocs.yml b/mkdocs.yml index 22deae9bc..4d964bdef 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -47,6 +47,7 @@ nav: - Networking: preview/configuration/networking.md - TLS: preview/configuration/tls.md - PostgreSQL Tuning: postgresql-tuning.md + - io_uring Async I/O: io-uring.md - Operations: - Upgrades: preview/operations/upgrades.md - Failover: preview/operations/failover.md diff --git a/operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml b/operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml index 425d33b25..672830030 100644 --- a/operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml +++ b/operator/documentdb-helm-chart/crds/documentdb.io_dbs.yaml @@ -1190,8 +1190,9 @@ spec: 3. Add a default entry in the featureGateDefaults map in documentdb_types.go type: object x-kubernetes-validations: - - message: 'unsupported feature gate key; allowed keys: ChangeStreams' - rule: self.all(key, key in ['ChangeStreams']) + - message: 'unsupported feature gate key; allowed keys: ChangeStreams, + IOUring' + rule: self.all(key, key in ['ChangeStreams', 'IOUring']) image: description: |- Image groups container image settings for the DocumentDB stack diff --git a/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml b/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml index 80422be25..6bb900f32 100644 --- a/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml +++ b/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml @@ -105,6 +105,10 @@ spec: - name: DOCUMENTDB_IMAGE_PULL_POLICY value: "{{ .Values.documentDbImagePullPolicy }}" {{- end }} + {{- if .Values.operator.ioUring.seccompProfile }} + - name: DOCUMENTDB_IOURING_SECCOMP_PROFILE + value: "{{ .Values.operator.ioUring.seccompProfile }}" + {{- end }} volumes: - name: webhook-cert secret: diff --git a/operator/documentdb-helm-chart/tests/09_operator_deployment_test.yaml b/operator/documentdb-helm-chart/tests/09_operator_deployment_test.yaml index 8e175b977..fb78474c9 100644 --- a/operator/documentdb-helm-chart/tests/09_operator_deployment_test.yaml +++ b/operator/documentdb-helm-chart/tests/09_operator_deployment_test.yaml @@ -127,6 +127,26 @@ tests: name: GATEWAY_PORT value: "10260" + - it: should set DOCUMENTDB_IOURING_SECCOMP_PROFILE when configured + set: + operator: + ioUring: + seccompProfile: "profiles/custom.json" + asserts: + - contains: + path: spec.template.spec.containers[0].env + content: + name: DOCUMENTDB_IOURING_SECCOMP_PROFILE + value: "profiles/custom.json" + + - it: should omit io_uring seccomp env var by default + asserts: + - notContains: + path: spec.template.spec.containers[0].env + content: + name: DOCUMENTDB_IOURING_SECCOMP_PROFILE + any: true + # ------------------------------------------------------------------- # Service account # ------------------------------------------------------------------- diff --git a/operator/documentdb-helm-chart/values.yaml b/operator/documentdb-helm-chart/values.yaml index 27128577f..106092a5c 100644 --- a/operator/documentdb-helm-chart/values.yaml +++ b/operator/documentdb-helm-chart/values.yaml @@ -107,6 +107,18 @@ operator: affinity: {} topologySpreadConstraints: [] priorityClassName: "" + # io_uring (PostgreSQL 18 asynchronous I/O) opt-in support. Enabling the + # IOUring feature gate on a DocumentDB resource makes the operator relax the + # postgres container seccomp profile so the io_uring syscalls are allowed. + # This operator-level setting controls the Localhost seccomp profile used for + # every DocumentDB managed by this operator. Leave empty to use the operator's + # built-in default (profiles/documentdb-iouring.json). + # See docs/operator-public-documentation/io-uring.md. + ioUring: + # seccompProfile: Localhost profile path relative to /var/lib/kubelet/seccomp. + # The profile must be installed on every node that runs postgres pods. + # Empty string keeps the operator default (profiles/documentdb-iouring.json). + seccompProfile: "" sidecarInjector: # See operator.resources comment — requests-only by convention. diff --git a/operator/src/api/preview/documentdb_funcs.go b/operator/src/api/preview/documentdb_funcs.go index 0221fdf5f..a91249562 100644 --- a/operator/src/api/preview/documentdb_funcs.go +++ b/operator/src/api/preview/documentdb_funcs.go @@ -8,6 +8,7 @@ package preview // in a future version, simply change its value here — no CRD schema change is needed. var featureGateDefaults = map[string]bool{ FeatureGateChangeStreams: false, + FeatureGateIOUring: false, } // IsFeatureGateEnabled checks whether a named feature gate is enabled for the given DocumentDB instance. diff --git a/operator/src/api/preview/documentdb_funcs_test.go b/operator/src/api/preview/documentdb_funcs_test.go index cda5c1c29..9fa8f13f2 100644 --- a/operator/src/api/preview/documentdb_funcs_test.go +++ b/operator/src/api/preview/documentdb_funcs_test.go @@ -19,6 +19,10 @@ var _ = Describe("IsFeatureGateEnabled", func() { It("returns the default value (false) for ChangeStreams", func() { Expect(IsFeatureGateEnabled(documentdb, FeatureGateChangeStreams)).To(BeFalse()) }) + + It("returns the default value (false) for IOUring", func() { + Expect(IsFeatureGateEnabled(documentdb, FeatureGateIOUring)).To(BeFalse()) + }) }) Context("when featureGates is an empty map", func() { @@ -29,6 +33,10 @@ var _ = Describe("IsFeatureGateEnabled", func() { It("returns the default value (false) for ChangeStreams", func() { Expect(IsFeatureGateEnabled(documentdb, FeatureGateChangeStreams)).To(BeFalse()) }) + + It("returns the default value (false) for IOUring", func() { + Expect(IsFeatureGateEnabled(documentdb, FeatureGateIOUring)).To(BeFalse()) + }) }) Context("when ChangeStreams is explicitly enabled", func() { @@ -55,6 +63,30 @@ var _ = Describe("IsFeatureGateEnabled", func() { }) }) + Context("when IOUring is explicitly enabled", func() { + BeforeEach(func() { + documentdb.Spec.FeatureGates = map[string]bool{ + FeatureGateIOUring: true, + } + }) + + It("returns true", func() { + Expect(IsFeatureGateEnabled(documentdb, FeatureGateIOUring)).To(BeTrue()) + }) + }) + + Context("when IOUring is explicitly disabled", func() { + BeforeEach(func() { + documentdb.Spec.FeatureGates = map[string]bool{ + FeatureGateIOUring: false, + } + }) + + It("returns false", func() { + Expect(IsFeatureGateEnabled(documentdb, FeatureGateIOUring)).To(BeFalse()) + }) + }) + Context("when an unknown feature gate is queried", func() { It("returns false when featureGates is nil", func() { Expect(IsFeatureGateEnabled(documentdb, "UnknownFeature")).To(BeFalse()) diff --git a/operator/src/api/preview/documentdb_types.go b/operator/src/api/preview/documentdb_types.go index ec69f59e7..b0cae72f1 100644 --- a/operator/src/api/preview/documentdb_types.go +++ b/operator/src/api/preview/documentdb_types.go @@ -13,6 +13,13 @@ import ( const ( // FeatureGateChangeStreams enables change stream support by setting wal_level=logical. FeatureGateChangeStreams = "ChangeStreams" + + // FeatureGateIOUring enables PostgreSQL 18 asynchronous I/O via io_method=io_uring + // and relaxes the postgres container seccomp profile so the io_uring_setup/enter/register + // syscalls (stripped from the container runtime's default profile) are allowed. + // Opt-in only: io_uring has been a recurring kernel-exploit surface, so it is disabled + // by default. See docs/operator-public-documentation/io-uring.md. + FeatureGateIOUring = "IOUring" ) // DocumentDBSpec defines the desired state of DocumentDB. @@ -103,7 +110,7 @@ type DocumentDBSpec struct { // 3. Add a default entry in the featureGateDefaults map in documentdb_types.go // // +optional - // +kubebuilder:validation:XValidation:rule="self.all(key, key in ['ChangeStreams'])",message="unsupported feature gate key; allowed keys: ChangeStreams" + // +kubebuilder:validation:XValidation:rule="self.all(key, key in ['ChangeStreams', 'IOUring'])",message="unsupported feature gate key; allowed keys: ChangeStreams, IOUring" FeatureGates map[string]bool `json:"featureGates,omitempty"` // SchemaVersion controls the desired schema version for the DocumentDB extension. diff --git a/operator/src/config/crd/bases/documentdb.io_dbs.yaml b/operator/src/config/crd/bases/documentdb.io_dbs.yaml index 425d33b25..672830030 100644 --- a/operator/src/config/crd/bases/documentdb.io_dbs.yaml +++ b/operator/src/config/crd/bases/documentdb.io_dbs.yaml @@ -1190,8 +1190,9 @@ spec: 3. Add a default entry in the featureGateDefaults map in documentdb_types.go type: object x-kubernetes-validations: - - message: 'unsupported feature gate key; allowed keys: ChangeStreams' - rule: self.all(key, key in ['ChangeStreams']) + - message: 'unsupported feature gate key; allowed keys: ChangeStreams, + IOUring' + rule: self.all(key, key in ['ChangeStreams', 'IOUring']) image: description: |- Image groups container image settings for the DocumentDB stack diff --git a/operator/src/internal/cnpg/cnpg_cluster.go b/operator/src/internal/cnpg/cnpg_cluster.go index bc8923993..591c5410b 100644 --- a/operator/src/internal/cnpg/cnpg_cluster.go +++ b/operator/src/internal/cnpg/cnpg_cluster.go @@ -127,6 +127,7 @@ func GetCnpgClusterSpec(req ctrl.Request, documentdb *dbpreview.DocumentDB, docu } spec.MaxStopDelay = getMaxStopDelayOrDefault(documentdb) applyPostgresProcessIdentity(&spec, documentdb) + applyIOUringSeccomp(&spec, documentdb) return spec }(), @@ -324,6 +325,29 @@ func applyPostgresProcessIdentity(spec *cnpgv1.ClusterSpec, documentdb *dbprevie } } +// applyIOUringSeccomp relaxes the postgres container seccomp profile when the +// IOUring feature gate is enabled. CNPG runs the postgres pods with +// seccompProfile=RuntimeDefault, but the container runtime strips the +// io_uring_{setup,enter,register} syscalls from that profile, so io_method=io_uring +// would otherwise crash with "could not setup io_uring queue: Operation not permitted". +// +// The operator references a Localhost seccomp profile that re-allows only the three +// io_uring syscalls. The profile path is operator-level configuration (the same +// decision applies to every DocumentDB on the cluster) and must be installed on every +// node that runs postgres pods (see the io-uring feature playground). +// +// No-op when the gate is disabled, so CNPG keeps its RuntimeDefault. +func applyIOUringSeccomp(spec *cnpgv1.ClusterSpec, documentdb *dbpreview.DocumentDB) { + if !dbpreview.IsFeatureGateEnabled(documentdb, dbpreview.FeatureGateIOUring) { + return + } + profile := cmp.Or(os.Getenv(util.IOURING_SECCOMP_PROFILE_ENV), util.DEFAULT_IOURING_SECCOMP_PROFILE) + spec.SeccompProfile = &corev1.SeccompProfile{ + Type: corev1.SeccompProfileTypeLocalhost, + LocalhostProfile: pointer.String(profile), + } +} + // buildPostgresConfiguration returns the cnpgv1.PostgresConfiguration block // for the cluster. // diff --git a/operator/src/internal/cnpg/cnpg_cluster_test.go b/operator/src/internal/cnpg/cnpg_cluster_test.go index 880786ca6..8b7c7ad19 100644 --- a/operator/src/internal/cnpg/cnpg_cluster_test.go +++ b/operator/src/internal/cnpg/cnpg_cluster_test.go @@ -521,6 +521,66 @@ var _ = Describe("GetCnpgClusterSpec", func() { }) }) + Context("IOUring seccomp profile", func() { + var req ctrl.Request + + BeforeEach(func() { + req = ctrl.Request{} + req.Name = "test-cluster" + req.Namespace = "default" + }) + + createDocumentDB := func(featureGateEnabled bool) *dbpreview.DocumentDB { + documentdb := &dbpreview.DocumentDB{ + Spec: dbpreview.DocumentDBSpec{ + InstancesPerNode: 1, + Resource: dbpreview.Resource{ + Storage: dbpreview.StorageConfiguration{ + PvcSize: "10Gi", + }, + }, + }, + } + if featureGateEnabled { + documentdb.Spec.FeatureGates = map[string]bool{ + dbpreview.FeatureGateIOUring: true, + } + } + return documentdb + } + + It("does not set seccomp profile or io_method when IOUring is disabled", func() { + cluster := GetCnpgClusterSpec(req, createDocumentDB(false), "test-image:latest", "test-sa", "", true, log) + + Expect(cluster.Spec.SeccompProfile).To(BeNil()) + Expect(cluster.Spec.PostgresConfiguration.Parameters).NotTo(HaveKey("io_method")) + }) + + It("uses the default Localhost seccomp profile when IOUring is enabled and env is unset", func() { + GinkgoT().Setenv(util.IOURING_SECCOMP_PROFILE_ENV, "") + + cluster := GetCnpgClusterSpec(req, createDocumentDB(true), "test-image:latest", "test-sa", "", true, log) + + Expect(cluster.Spec.SeccompProfile).ToNot(BeNil()) + Expect(cluster.Spec.SeccompProfile.Type).To(Equal(corev1.SeccompProfileTypeLocalhost)) + Expect(cluster.Spec.SeccompProfile.LocalhostProfile).ToNot(BeNil()) + Expect(*cluster.Spec.SeccompProfile.LocalhostProfile).To(Equal(util.DEFAULT_IOURING_SECCOMP_PROFILE)) + Expect(cluster.Spec.PostgresConfiguration.Parameters).To(HaveKeyWithValue("io_method", "io_uring")) + }) + + It("uses the custom Localhost seccomp profile when configured", func() { + GinkgoT().Setenv(util.IOURING_SECCOMP_PROFILE_ENV, "profiles/custom-iouring.json") + + cluster := GetCnpgClusterSpec(req, createDocumentDB(true), "test-image:latest", "test-sa", "", true, log) + + Expect(cluster.Spec.SeccompProfile).ToNot(BeNil()) + Expect(cluster.Spec.SeccompProfile.Type).To(Equal(corev1.SeccompProfileTypeLocalhost)) + Expect(cluster.Spec.SeccompProfile.LocalhostProfile).ToNot(BeNil()) + Expect(*cluster.Spec.SeccompProfile.LocalhostProfile).To(Equal("profiles/custom-iouring.json")) + Expect(cluster.Spec.PostgresConfiguration.Parameters).To(HaveKeyWithValue("io_method", "io_uring")) + }) + }) + It("always includes default PostgreSQL parameters", func() { req := ctrl.Request{} req.Name = "test-cluster" diff --git a/operator/src/internal/cnpg/pg_defaults.go b/operator/src/internal/cnpg/pg_defaults.go index 0566d9e69..54d8eb51e 100644 --- a/operator/src/internal/cnpg/pg_defaults.go +++ b/operator/src/internal/cnpg/pg_defaults.go @@ -90,6 +90,9 @@ func ProtectedParameters(documentdb *dbpreview.DocumentDB) map[string]string { if dbpreview.IsFeatureGateEnabled(documentdb, dbpreview.FeatureGateChangeStreams) { params["wal_level"] = "logical" } + if dbpreview.IsFeatureGateEnabled(documentdb, dbpreview.FeatureGateIOUring) { + params["io_method"] = "io_uring" + } return params } diff --git a/operator/src/internal/cnpg/pg_defaults_test.go b/operator/src/internal/cnpg/pg_defaults_test.go index a2ae5d135..f9cdeea5d 100644 --- a/operator/src/internal/cnpg/pg_defaults_test.go +++ b/operator/src/internal/cnpg/pg_defaults_test.go @@ -196,6 +196,10 @@ var _ = Describe("ProtectedParameters", func() { It("does not contain wal_level", func() { Expect(result).NotTo(HaveKey("wal_level")) }) + + It("does not contain io_method", func() { + Expect(result).NotTo(HaveKey("io_method")) + }) }) Context("with ChangeStreams enabled", func() { @@ -220,6 +224,29 @@ var _ = Describe("ProtectedParameters", func() { Expect(result["cron.database_name"]).To(Equal("postgres")) }) }) + + Context("with IOUring enabled", func() { + var result map[string]string + + BeforeEach(func() { + documentdb := &dbpreview.DocumentDB{ + Spec: dbpreview.DocumentDBSpec{ + FeatureGates: map[string]bool{ + dbpreview.FeatureGateIOUring: true, + }, + }, + } + result = ProtectedParameters(documentdb) + }) + + It("sets io_method to io_uring", func() { + Expect(result["io_method"]).To(Equal("io_uring")) + }) + + It("still contains other protected params", func() { + Expect(result["cron.database_name"]).To(Equal("postgres")) + }) + }) }) var _ = Describe("MergeParameters", func() { diff --git a/operator/src/internal/utils/constants.go b/operator/src/internal/utils/constants.go index e47d66385..264834463 100644 --- a/operator/src/internal/utils/constants.go +++ b/operator/src/internal/utils/constants.go @@ -17,6 +17,17 @@ const ( // DocumentDB extension image pull policy environment variable DOCUMENTDB_IMAGE_PULL_POLICY_ENV = "DOCUMENTDB_IMAGE_PULL_POLICY" + // IOURING_SECCOMP_PROFILE_ENV overrides the Localhost seccomp profile path + // applied to the postgres pods when the IOUring feature gate is enabled. The + // path is relative to the node's kubelet seccomp root (/var/lib/kubelet/seccomp). + IOURING_SECCOMP_PROFILE_ENV = "DOCUMENTDB_IOURING_SECCOMP_PROFILE" + + // DEFAULT_IOURING_SECCOMP_PROFILE is the default Localhost profile path for + // the IOUring feature gate. It must be installed on every node that runs + // postgres pods (see the io-uring feature playground) and is the upstream + // RuntimeDefault profile plus the io_uring_{setup,enter,register} syscalls. + DEFAULT_IOURING_SECCOMP_PROFILE = "profiles/documentdb-iouring.json" + // Image repositories for deb-based images (must match build_images.yml naming) DOCUMENTDB_EXTENSION_IMAGE_REPO = "ghcr.io/documentdb/documentdb-kubernetes-operator/documentdb" GATEWAY_IMAGE_REPO = "ghcr.io/documentdb/documentdb-kubernetes-operator/gateway" diff --git a/test/e2e/labels.go b/test/e2e/labels.go index bbf21a25f..bc9441d11 100644 --- a/test/e2e/labels.go +++ b/test/e2e/labels.go @@ -13,17 +13,17 @@ import "github.com/onsi/ginkgo/v2" // Keep these in sync with the design document. const ( // Area labels — one per test area (tests//). - LifecycleLabel = "lifecycle" - ScaleLabel = "scale" - DataLabel = "data" - PerformanceLabel = "performance" - BackupLabel = "backup" - RecoveryLabel = "recovery" - TLSLabel = "tls" - FeatureLabel = "feature-gates" - ExposureLabel = "exposure" - StatusLabel = "status" - UpgradeLabel = "upgrade" + LifecycleLabel = "lifecycle" + ScaleLabel = "scale" + DataLabel = "data" + PerformanceLabel = "performance" + BackupLabel = "backup" + RecoveryLabel = "recovery" + TLSLabel = "tls" + FeatureLabel = "feature-gates" + ExposureLabel = "exposure" + StatusLabel = "status" + UpgradeLabel = "upgrade" ClusterReplicationLabel = "cluster-replication" // Cross-cutting selectors. @@ -43,6 +43,13 @@ const ( // plus a resize-capable CSI driver). Environments that lack this // capability should filter with `--label-filter='!needs-csi-resize'`. NeedsCSIResizeLabel = "needs-csi-resize" + // NeedsIOUringLabel marks specs that require io_uring to actually work + // on the cluster nodes: an io_uring-capable kernel (io_uring_disabled=0) + // plus the documentdb-iouring Localhost seccomp profile installed under + // the kubelet seccomp root. Environments that lack this capability should + // filter with `--label-filter='!needs-iouring'`; the specs additionally + // self-skip unless E2E_IOURING=1 (see tests/feature_gates/iouring_test.go). + NeedsIOUringLabel = "needs-iouring" ) // Level labels expose the depth tier of a spec to Ginkgo's label filter. diff --git a/test/e2e/manifests/mixins/feature_iouring.yaml.template b/test/e2e/manifests/mixins/feature_iouring.yaml.template new file mode 100644 index 000000000..42f7ed5b7 --- /dev/null +++ b/test/e2e/manifests/mixins/feature_iouring.yaml.template @@ -0,0 +1,8 @@ +apiVersion: documentdb.io/preview +kind: DocumentDB +metadata: + name: ${NAME} + namespace: ${NAMESPACE} +spec: + featureGates: + IOUring: true diff --git a/test/e2e/tests/feature_gates/iouring_test.go b/test/e2e/tests/feature_gates/iouring_test.go new file mode 100644 index 000000000..eb211f821 --- /dev/null +++ b/test/e2e/tests/feature_gates/iouring_test.go @@ -0,0 +1,111 @@ +package feature_gates + +import ( + "context" + "fmt" + "os" + "time" + + . "github.com/onsi/ginkgo/v2" //nolint:revive + . "github.com/onsi/gomega" //nolint:revive + + cnpgv1 "github.com/cloudnative-pg/cloudnative-pg/api/v1" + corev1 "k8s.io/api/core/v1" + "sigs.k8s.io/controller-runtime/pkg/client" + + previewv1 "github.com/documentdb/documentdb-operator/api/preview" + "github.com/documentdb/documentdb-operator/test/e2e" + mongohelper "github.com/documentdb/documentdb-operator/test/e2e/pkg/e2eutils/mongo" + sharedmongo "github.com/documentdb/documentdb-operator/test/shared/mongo" +) + +// backingCluster fetches the CNPG Cluster that backs the given +// DocumentDB. The Cluster name equals the DocumentDB name for single- +// cluster deployments (see the lifecycle deploy spec). +func backingCluster(ctx context.Context, c client.Client, dd *previewv1.DocumentDB) (*cnpgv1.Cluster, error) { + cluster := &cnpgv1.Cluster{} + if err := c.Get(ctx, client.ObjectKey{Namespace: dd.Namespace, Name: dd.Name}, cluster); err != nil { + return nil, fmt.Errorf("get CNPG Cluster %s/%s: %w", dd.Namespace, dd.Name, err) + } + return cluster, nil +} + +// DocumentDB feature-gates / io_uring. +// +// The operator translates `spec.featureGates.IOUring=true` into two +// changes on the underlying CNPG Cluster (see operator/src/internal/ +// cnpg/cnpg_cluster.go and pg_defaults.go): +// 1. postgresql.parameters["io_method"] = "io_uring", enabling +// PostgreSQL 18 asynchronous I/O; and +// 2. a Localhost seccomp profile on the postgres container that +// re-allows the io_uring_setup/enter/register syscalls which +// CNPG's default RuntimeDefault profile strips. +// +// Without (2), postgres FATALs at startup ("could not setup io_uring +// queue: Operation not permitted") and the cluster never reaches +// Healthy. So the fact that setupFreshCluster's WaitHealthy returns IS +// the end-to-end proof that the seccomp relaxation works on the target +// kernel — there is no need (and no harness helper) to exec `SHOW +// io_method` inside the pod. +// +// This spec is gated behind the needs-iouring capability label AND a +// runtime E2E_IOURING=1 opt-in, because a default kind/CI node lacks +// both the io_uring-capable kernel and the documentdb-iouring Localhost +// seccomp profile; running it there would crashloop postgres rather +// than skip cleanly. The disabled-gate translation is already covered +// by the operator unit tests, so this spec focuses on the high-value +// enabled-and-healthy path that only an end-to-end environment can +// exercise. +var _ = Describe("DocumentDB feature-gates — io_uring", + Label(e2e.FeatureLabel, e2e.NeedsIOUringLabel), e2e.MediumLevelLabel, + func() { + BeforeEach(func() { + e2e.SkipUnlessLevel(e2e.Medium) + if os.Getenv("E2E_IOURING") != "1" { + Skip("io_uring spec requires E2E_IOURING=1 and a cluster whose nodes carry " + + "the documentdb-iouring Localhost seccomp profile on an io_uring-capable kernel") + } + }) + + It("sets io_method=io_uring with a relaxed seccomp profile and stays healthy", func() { + env := e2e.SuiteEnv() + Expect(env).ToNot(BeNil(), "SuiteEnv must be initialized") + c := env.Client + + ctx, cancel := context.WithTimeout(context.Background(), 12*time.Minute) + DeferCleanup(cancel) + + // Reaching past setupFreshCluster means the cluster + // became Healthy with io_uring enabled — i.e. postgres + // started under the relaxed seccomp profile without the + // "Operation not permitted" crashloop. + dd, cleanup := setupFreshCluster(ctx, c, "ft-iouring", []string{"feature_iouring"}, nil) + DeferCleanup(cleanup) + + cluster, err := backingCluster(ctx, c, dd) + Expect(err).ToNot(HaveOccurred()) + + Expect(cluster.Spec.PostgresConfiguration.Parameters).To( + HaveKeyWithValue("io_method", "io_uring"), + "IOUring gate must set io_method=io_uring on the CNPG Cluster") + + Expect(cluster.Spec.SeccompProfile).ToNot(BeNil(), + "IOUring gate must set a Localhost seccomp profile on the CNPG Cluster") + Expect(cluster.Spec.SeccompProfile.Type).To(Equal(corev1.SeccompProfileTypeLocalhost), + "IOUring seccomp profile must be Localhost, not RuntimeDefault/Unconfined") + Expect(cluster.Spec.SeccompProfile.LocalhostProfile).ToNot(BeNil(), + "Localhost seccomp profile must reference a profile path") + Expect(*cluster.Spec.SeccompProfile.LocalhostProfile).To( + HaveSuffix("documentdb-iouring.json"), + "Localhost profile should point at the documentdb-iouring profile") + + // Data-plane smoke: prove the gateway still answers on + // the wire with io_uring active, so "Healthy" cannot mask + // a postgres that is up but unable to serve queries. + h, err := mongohelper.NewFromDocumentDB(ctx, env, dd.Namespace, dd.Name) + Expect(err).ToNot(HaveOccurred(), "connect mongo to io_uring DocumentDB") + DeferCleanup(func(ctx SpecContext) { _ = h.Close(ctx) }) + Expect(sharedmongo.Ping(ctx, h.Client())).To(Succeed(), + "ping io_uring DocumentDB gateway") + }) + })