Defining a pod anti-affinity rule is a way of increasing your application’s availability. However there are some limitations such as:
os=windows. The windows label is arbitrary and is only used to demonstrate that topology spread constraints works on a filtered set of nodes. Lastly, I made sure that 2 nodes are in az2.
$ kubectl get nodes -o wide --show-labels | grep windows | awk -F' |=' '{print $1 " " $17 " " $48}'
3ebcdcd1-d647-4814-a14d-66ca3bd85313 172.42.129.8 az2
41300042-c376-42e9-b7bb-d40f2330f18e 172.42.129.7 az2
92db06e7-1258-417f-9bd7-e8644059aa7b 172.42.129.6 az1
9cd0d589-f959-4117-8c01-a1b0eb5d931e 172.42.129.9 az3
$ cat /tmp/az-spread-test-anti.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: az-spread-test-anti
name: az-spread-test-anti
spec:
replicas: 6
selector:
matchLabels:
app: az-spread-test-anti
template:
metadata:
creationTimestamp: null
labels:
app: az-spread-test-anti
version: v6
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: os
operator: In
values:
- windows
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- az-spread-test-anti
topologyKey: topology.kubernetes.io/zone
containers:
- image: nginx:1.17
name: nginx
$ kubectl apply -f /tmp/az-spread-test-anti.yaml
deployment.apps/az-spread-test-anti created
$ kubectl get po --selector=app=az-spread-test-anti
NAME READY STATUS RESTARTS AGE
az-spread-test-anti-75d998d6b9-4pwhx 0/1 Pending 0 27s
az-spread-test-anti-75d998d6b9-97tlw 1/1 Running 0 27s
az-spread-test-anti-75d998d6b9-knzln 0/1 Pending 0 27s
az-spread-test-anti-75d998d6b9-nwdcc 0/1 Pending 0 27s
az-spread-test-anti-75d998d6b9-vr5tx 1/1 Running 0 27s
az-spread-test-anti-75d998d6b9-xwhd2 1/1 Running 0 27s
The next example demonstrates uneven distribution of pods after the topology is exhausted when pod anti-affinity is using preferredDuringSchedulingIgnoredDuringExecution. Notice there are 21 replicas. Given there are only 3 availability zones, the pod anti-affinity rules are no longer obeyed after the 3rd pod is scheduled. Notice that the pods are not evenly distributed across availability zones:
$ cat /tmp/az-spread-test-anti-preferred.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: az-spread-test-anti
name: az-spread-test-anti
spec:
replicas: 21
selector:
matchLabels:
app: az-spread-test-anti
template:
metadata:
creationTimestamp: null
labels:
app: az-spread-test-anti
version: v6
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: os
operator: In
values:
- windows
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- az-spread-test-anti
topologyKey: topology.kubernetes.io/zone
containers:
- image: nginx:1.17
name: nginx
$ kubectl get po --selector=app=az-spread-test-anti
NAME READY STATUS RESTARTS AGE
az-spread-test-anti-8587c9c667-2cd7z 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-52swt 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-56j2j 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-87j5z 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-8dxfr 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-8fzqt 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-bkccs 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-cb4zv 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-cs44w 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-f988x 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-fglxp 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-fxgdd 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-h45tl 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-h9c84 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-l67n8 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-s984k 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-vf7f2 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-whxrj 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-x22xg 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-xhrbh 1/1 Running 0 18s
az-spread-test-anti-8587c9c667-zqrvr 1/1 Running 0 18s
$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
11 az1
14 az2
11 az3
The next example demonstrates the behavior observed when the same deployment is scaled down to 15. Notice again that the pods are not evenly distributed:
$ kubectl scale deployment/az-spread-test-anti --replicas=15
deployment.apps/az-spread-test-anti scaled
$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
8 az1
14 az2
8 az3
maxSkew. This is due to interference from the old Terminating pods in the scheduler's topology size calculation. At scheduling time, the constraints are technically satisfied, however, once the old pods are removed, the topology size difference can be greater than maxSkew.
$ cat /tmp/az-spread-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: az-spread-test
name: az-spread-test
spec:
replicas: 21
selector:
matchLabels:
app: az-spread-test
template:
metadata:
creationTimestamp: null
labels:
app: az-spread-test
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: os
operator: In
values:
- windows
containers:
- image: nginx
name: nginx
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: az-spread-test
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
$ kubectl apply -f /tmp/az-spread-test.yaml --record
deployment.apps/az-spread-test created
$(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
7 az1
7 az2
7 az3
The next example demonstrates how a rolling update without the additional version label results in an uneven distribution of pods across availability zones:
$ kubectl set image deployment/az-spread-test nginx=nginx:1.17 --record
deployment.apps/az-spread-test image updated
$ kubectl rollout status deployment/az-spread-test
Waiting for deployment "az-spread-test" rollout to finish: 20 of 21 updated replicas are available...
deployment "az-spread-test" successfully rolled out
$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do k get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
8 az1
4 az2
9 az3
To get around the imbalance issue during rolling updates and scaling down, the update has to be done by modifying the deployment yaml, adding and incrementing a version label for each apply.
$ cat /tmp/az-spread-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: az-spread-test
name: az-spread-test
spec:
replicas: 21
selector:
matchLabels:
app: az-spread-test
template:
metadata:
creationTimestamp: null
labels:
app: az-spread-test
version: v1
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: os
operator: In
values:
- windows
containers:
- image: nginx
name: nginx
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: az-spread-test
version: v1
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
$ kubectl apply -f /tmp/az-spread-test.yaml --record
deployment.apps/az-spread-test created
$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
7 az1
7 az2
7 az3
For the next example I increment the version label (which is arbitrary) to v2 and scale down to 15. Notice that the 15 pods are evenly distributed across the availability zones:
$ cat /tmp/az-spread-test.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: az-spread-test
name: az-spread-test
spec:
replicas: 15
selector:
matchLabels:
app: az-spread-test
template:
metadata:
creationTimestamp: null
labels:
app: az-spread-test
version: v2
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: os
operator: In
values:
- windows
containers:
- image: nginx:1.17
name: nginx
topologySpreadConstraints:
- labelSelector:
matchLabels:
app: az-spread-test
version: v2
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
$ kubectl apply -f /tmp/az-spread-test.yaml --record
deployment.apps/az-spread-test configured
$ for node in $(kubectl get po -o wide | grep -v NODE | awk '{print $7}'); do kubectl get node $node --show-labels | grep -v NAME | cut -d'=' -f14; done | sort | uniq -c
5 az1
5 az2
5 az3