How to evenly distribute pods across a topology without using pod anti-affinity?

search cancel

How to evenly distribute pods across a topology without using pod anti-affinity?

book

Article ID: 298697

calendar_today

Updated On: 02-04-2022

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

How to distribute pods evenly across a topology such as availability zones while overcoming the limitations of pod anti-affinity?

Environment

Product Version: 1.11

Resolution

Defining a pod anti-affinity rule is a way of increasing your application’s availability. However there are some limitations such as:

Uneven distribution of pods after the topology is exhausted. For example if the topology is zone, pod anti-affinity rules will no longer apply once the pod replica count exceeds the zone count.
Uneven distribution of pods during scale down or after a rolling update.

Below we will explore the limitations of using pod anti-affinity so that we can better understand what problem topology spread constraints trying to solve.

For the examples below I have a 5 node cluster across 3 AZs. 4 of the nodes have a custom label os=windows. The windows label is arbitrary and is only used to demonstrate that topology spread constraints works on a filtered set of nodes. Lastly, I made sure that 2 nodes are in az2.

For example:

$ kubectl get nodes -o wide --show-labels | grep  windows | awk -F' |=' '{print $1 " " $17 " " $48}'
3ebcdcd1-d647-4814-a14d-66ca3bd85313 172.42.129.8 az2
41300042-c376-42e9-b7bb-d40f2330f18e 172.42.129.7 az2
92db06e7-1258-417f-9bd7-e8644059aa7b 172.42.129.6 az1
9cd0d589-f959-4117-8c01-a1b0eb5d931e 172.42.129.9 az3

The limitations of pod anti-affinity

The first example demonstrates that pods are unable to be scheduled after the topology is exhausted when pod anti-affinity is using requiredDuringSchedulingIgnoredDuringExecution. Notice there are 6 replicas. Due to there only being 3 availability zones, 3 of the 6 pods can't be scheduled and are in pending:

$ cat /tmp/az-spread-test-anti.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: az-spread-test-anti
  name: az-spread-test-anti
spec:
  replicas: 6
  selector:
    matchLabels:
      app: az-spread-test-anti
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: az-spread-test-anti
        version: v6
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: os
                operator: In
                values:
                - windows
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - az-spread-test-anti
            topologyKey: topology.kubernetes.io/zone
      containers:
      - image: nginx:1.17
        name: nginx

$ kubectl apply -f /tmp/az-spread-test-anti.yaml
deployment.apps/az-spread-test-anti created

$ kubectl get po --selector=app=az-spread-test-anti
NAME                                   READY   STATUS    RESTARTS   AGE
az-spread-test-anti-75d998d6b9-4pwhx   0/1     Pending   0          27s
az-spread-test-anti-75d998d6b9-97tlw   1/1     Running   0          27s
az-spread-test-anti-75d998d6b9-knzln   0/1     Pending   0          27s
az-spread-test-anti-75d998d6b9-nwdcc   0/1     Pending   0          27s
az-spread-test-anti-75d998d6b9-vr5tx   1/1     Running   0          27s
az-spread-test-anti-75d998d6b9-xwhd2   1/1     Running   0          27s

The next example demonstrates uneven distribution of pods after the topology is exhausted when pod anti-affinity is using preferredDuringSchedulingIgnoredDuringExecution. Notice there are 21 replicas. Given there are only 3 availability zones, the pod anti-affinity rules are no longer obeyed after the 3rd pod is scheduled. Notice that the pods are not evenly distributed across availability zones:

$ cat /tmp/az-spread-test-anti-preferred.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: az-spread-test-anti
  name: az-spread-test-anti
spec:
  replicas: 21
  selector:
    matchLabels:
      app: az-spread-test-anti
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: az-spread-test-anti
        version: v6
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: os
                operator: In
                values:
                - windows
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - az-spread-test-anti
              topologyKey: topology.kubernetes.io/zone
      containers:
      - image: nginx:1.17
        name: nginx

$ kubectl get po --selector=app=az-spread-test-anti
NAME                                   READY   STATUS    RESTARTS   AGE
az-spread-test-anti-8587c9c667-2cd7z   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-52swt   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-56j2j   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-87j5z   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-8dxfr   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-8fzqt   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-bkccs   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-cb4zv   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-cs44w   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-f988x   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-fglxp   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-fxgdd   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-h45tl   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-h9c84   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-l67n8   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-s984k   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-vf7f2   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-whxrj   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-x22xg   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-xhrbh   1/1     Running   0          18s
az-spread-test-anti-8587c9c667-zqrvr   1/1     Running   0          18s

$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
     11 az1
     14 az2
     11 az3

The next example demonstrates the behavior observed when the same deployment is scaled down to 15. Notice again that the pods are not evenly distributed:

$ kubectl scale deployment/az-spread-test-anti --replicas=15
deployment.apps/az-spread-test-anti scaled

$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
      8 az1
     14 az2
      8 az3

Using topology spread constraints to overcome the limitations of pod anti-affinity

The Kubernetes documentation states: "You can use topology spread constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization."

Take note that when using a rolling deployment, the new pods can end up in imbalanced topologies that violates the specified maxSkew. This is due to interference from the old Terminating pods in the scheduler's topology size calculation. At scheduling time, the constraints are technically satisfied, however, once the old pods are removed, the topology size difference can be greater than maxSkew.

The first example demonstrates the use of topology spread constraints on initial deployment. Notice that the 21 pods are evenly distributed across the availability zones:

$ cat /tmp/az-spread-test.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: az-spread-test
  name: az-spread-test
spec:
  replicas: 21
  selector:
    matchLabels:
      app: az-spread-test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: az-spread-test
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: os
                operator: In
                values:
                - windows
      containers:
      - image: nginx
        name: nginx
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: az-spread-test
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway

$ kubectl apply -f /tmp/az-spread-test.yaml --record
deployment.apps/az-spread-test created

$(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
      7 az1
      7 az2
      7 az3

The next example demonstrates how a rolling update without the additional version label results in an uneven distribution of pods across availability zones:

$ kubectl set image deployment/az-spread-test nginx=nginx:1.17 --record
deployment.apps/az-spread-test image updated

$ kubectl rollout status deployment/az-spread-test
Waiting for deployment "az-spread-test" rollout to finish: 20 of 21 updated replicas are available...
deployment "az-spread-test" successfully rolled out

$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do k get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
      8 az1
      4 az2
      9 az3

To get around the imbalance issue during rolling updates and scaling down, the update has to be done by modifying the deployment yaml, adding and incrementing a version label for each apply.

For the next example I've added an additional version label (which is arbitrary). Notice that the 21 pods are evenly distributed across the availability zones:

$ cat /tmp/az-spread-test.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: az-spread-test
  name: az-spread-test
spec:
  replicas: 21
  selector:
    matchLabels:
      app: az-spread-test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: az-spread-test
        version: v1
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: os
                operator: In
                values:
                - windows
      containers:
      - image: nginx
        name: nginx
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: az-spread-test
            version: v1
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway

$ kubectl apply -f /tmp/az-spread-test.yaml --record
deployment.apps/az-spread-test created

$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
      7 az1
      7 az2
      7 az3

For the next example I increment the version label (which is arbitrary) to v2 and scale down to 15. Notice that the 15 pods are evenly distributed across the availability zones:

$ cat /tmp/az-spread-test.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: az-spread-test
  name: az-spread-test
spec:
  replicas: 15
  selector:
    matchLabels:
      app: az-spread-test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: az-spread-test
        version: v2
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: os
                operator: In
                values:
                - windows
      containers:
      - image: nginx:1.17
        name: nginx
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app: az-spread-test
            version: v2
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway

$ kubectl apply -f /tmp/az-spread-test.yaml --record
deployment.apps/az-spread-test configured

$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
      5 az1
      5 az2
      5 az3

Summary

Pod anti-affinity will only work until the topology is exhausted (there are more replicas (pods) than there are zones, etc). Pod Topology Spread Constraints can be used to get past these limitations.

References

https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

Feedback

thumb_up Yes

thumb_down No