Defining a pod anti-affinity rule is a way of increasing your application’s availability. However there are some limitations such as:
os=windows
. The windows label is arbitrary and is only used to demonstrate that topology spread constraints works on a filtered set of nodes. Lastly, I made sure that 2 nodes are in az2.$ kubectl get nodes -o wide --show-labels | grep windows | awk -F' |=' '{print $1 " " $17 " " $48}' 3ebcdcd1-d647-4814-a14d-66ca3bd85313 172.42.129.8 az2 41300042-c376-42e9-b7bb-d40f2330f18e 172.42.129.7 az2 92db06e7-1258-417f-9bd7-e8644059aa7b 172.42.129.6 az1 9cd0d589-f959-4117-8c01-a1b0eb5d931e 172.42.129.9 az3
$ cat /tmp/az-spread-test-anti.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: az-spread-test-anti name: az-spread-test-anti spec: replicas: 6 selector: matchLabels: app: az-spread-test-anti template: metadata: creationTimestamp: null labels: app: az-spread-test-anti version: v6 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: os operator: In values: - windows podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - az-spread-test-anti topologyKey: topology.kubernetes.io/zone containers: - image: nginx:1.17 name: nginx $ kubectl apply -f /tmp/az-spread-test-anti.yaml deployment.apps/az-spread-test-anti created $ kubectl get po --selector=app=az-spread-test-anti NAME READY STATUS RESTARTS AGE az-spread-test-anti-75d998d6b9-4pwhx 0/1 Pending 0 27s az-spread-test-anti-75d998d6b9-97tlw 1/1 Running 0 27s az-spread-test-anti-75d998d6b9-knzln 0/1 Pending 0 27s az-spread-test-anti-75d998d6b9-nwdcc 0/1 Pending 0 27s az-spread-test-anti-75d998d6b9-vr5tx 1/1 Running 0 27s az-spread-test-anti-75d998d6b9-xwhd2 1/1 Running 0 27sThe next example demonstrates uneven distribution of pods after the topology is exhausted when pod anti-affinity is using preferredDuringSchedulingIgnoredDuringExecution. Notice there are 21 replicas. Given there are only 3 availability zones, the pod anti-affinity rules are no longer obeyed after the 3rd pod is scheduled. Notice that the pods are not evenly distributed across availability zones:
$ cat /tmp/az-spread-test-anti-preferred.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: az-spread-test-anti name: az-spread-test-anti spec: replicas: 21 selector: matchLabels: app: az-spread-test-anti template: metadata: creationTimestamp: null labels: app: az-spread-test-anti version: v6 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: os operator: In values: - windows podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - az-spread-test-anti topologyKey: topology.kubernetes.io/zone containers: - image: nginx:1.17 name: nginx $ kubectl get po --selector=app=az-spread-test-anti NAME READY STATUS RESTARTS AGE az-spread-test-anti-8587c9c667-2cd7z 1/1 Running 0 18s az-spread-test-anti-8587c9c667-52swt 1/1 Running 0 18s az-spread-test-anti-8587c9c667-56j2j 1/1 Running 0 18s az-spread-test-anti-8587c9c667-87j5z 1/1 Running 0 18s az-spread-test-anti-8587c9c667-8dxfr 1/1 Running 0 18s az-spread-test-anti-8587c9c667-8fzqt 1/1 Running 0 18s az-spread-test-anti-8587c9c667-bkccs 1/1 Running 0 18s az-spread-test-anti-8587c9c667-cb4zv 1/1 Running 0 18s az-spread-test-anti-8587c9c667-cs44w 1/1 Running 0 18s az-spread-test-anti-8587c9c667-f988x 1/1 Running 0 18s az-spread-test-anti-8587c9c667-fglxp 1/1 Running 0 18s az-spread-test-anti-8587c9c667-fxgdd 1/1 Running 0 18s az-spread-test-anti-8587c9c667-h45tl 1/1 Running 0 18s az-spread-test-anti-8587c9c667-h9c84 1/1 Running 0 18s az-spread-test-anti-8587c9c667-l67n8 1/1 Running 0 18s az-spread-test-anti-8587c9c667-s984k 1/1 Running 0 18s az-spread-test-anti-8587c9c667-vf7f2 1/1 Running 0 18s az-spread-test-anti-8587c9c667-whxrj 1/1 Running 0 18s az-spread-test-anti-8587c9c667-x22xg 1/1 Running 0 18s az-spread-test-anti-8587c9c667-xhrbh 1/1 Running 0 18s az-spread-test-anti-8587c9c667-zqrvr 1/1 Running 0 18s $ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c 11 az1 14 az2 11 az3The next example demonstrates the behavior observed when the same deployment is scaled down to 15. Notice again that the pods are not evenly distributed:
$ kubectl scale deployment/az-spread-test-anti --replicas=15 deployment.apps/az-spread-test-anti scaled $ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c 8 az1 14 az2 8 az3
maxSkew
. This is due to interference from the old Terminating pods in the scheduler's topology size calculation. At scheduling time, the constraints are technically satisfied, however, once the old pods are removed, the topology size difference can be greater than maxSkew
.$ cat /tmp/az-spread-test.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: az-spread-test name: az-spread-test spec: replicas: 21 selector: matchLabels: app: az-spread-test template: metadata: creationTimestamp: null labels: app: az-spread-test spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: os operator: In values: - windows containers: - image: nginx name: nginx topologySpreadConstraints: - labelSelector: matchLabels: app: az-spread-test maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway $ kubectl apply -f /tmp/az-spread-test.yaml --record deployment.apps/az-spread-test created $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c 7 az1 7 az2 7 az3The next example demonstrates how a rolling update without the additional version label results in an uneven distribution of pods across availability zones:
$ kubectl set image deployment/az-spread-test nginx=nginx:1.17 --record deployment.apps/az-spread-test image updated $ kubectl rollout status deployment/az-spread-test Waiting for deployment "az-spread-test" rollout to finish: 20 of 21 updated replicas are available... deployment "az-spread-test" successfully rolled out $ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do k get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c 8 az1 4 az2 9 az3To get around the imbalance issue during rolling updates and scaling down, the update has to be done by modifying the deployment yaml, adding and incrementing a version label for each apply.
$ cat /tmp/az-spread-test.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: az-spread-test name: az-spread-test spec: replicas: 21 selector: matchLabels: app: az-spread-test template: metadata: creationTimestamp: null labels: app: az-spread-test version: v1 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: os operator: In values: - windows containers: - image: nginx name: nginx topologySpreadConstraints: - labelSelector: matchLabels: app: az-spread-test version: v1 maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway $ kubectl apply -f /tmp/az-spread-test.yaml --record deployment.apps/az-spread-test created $ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c 7 az1 7 az2 7 az3For the next example I increment the version label (which is arbitrary) to v2 and scale down to 15. Notice that the 15 pods are evenly distributed across the availability zones:
$ cat /tmp/az-spread-test.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: az-spread-test name: az-spread-test spec: replicas: 15 selector: matchLabels: app: az-spread-test template: metadata: creationTimestamp: null labels: app: az-spread-test version: v2 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: os operator: In values: - windows containers: - image: nginx:1.17 name: nginx topologySpreadConstraints: - labelSelector: matchLabels: app: az-spread-test version: v2 maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway $ kubectl apply -f /tmp/az-spread-test.yaml --record deployment.apps/az-spread-test configured $ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c 5 az1 5 az2 5 az3