TKGI update-cluster doesn't link compute-profile to cluster successfully due to duplicate tag keys

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

A TKGI cluster is updated with multiple tags first and there are some tags with same key name. For example, cluster c2 has 2 tags with same name "tier" but different value.

$ tkgi cluster c2

Upgrade is available to PKS Version: 1.13.10-build.10

PKS Version:              1.12.8-build.9
Name:                     c2
K8s Version:              1.21.14
Plan Name:                small
UUID:                     7ac95839-88ed-4121-a110-7696aefecbb7
Last Action:              UPDATE
Last Action State:        succeeded
Last Action Description:  Instance update completed
Kubernetes Master Host:   c2.example.com
Kubernetes Master Port:   8443
Worker Nodes:             2
Kubernetes Master IP(s):  x.x.x.x
Network Profile Name:     
Kubernetes Profile Name:  
Compute Profile Name:     
Tags:                     env:prod,external:no,pci-dss:no,shared:yes,tier:2,tier:3

Then tkgi update-cluster is run to associate a compute-profile with the cluster, the command might return an error (this depends on tkgi CLI version being used. For old version like 1.9, there won't be error returned)

$ tkgi update-cluster c2 --compute-profile myworker --node-pool-instances "worker-small:1,worker-medium:1"

Update summary for cluster c2:
Compute Profile Name: myworker
Node Pool Instances: worker-small:1,worker-medium:1
Are you sure you want to continue? (y/n): y
Error: An error occurred in the PKS API when processing

"Duplicate key" exception is also seen in pks-api.log file.

2023-09-26 01:47:42.110 ERROR 6490 --- [nio-9021-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Reque
st processing failed; nested exception is java.lang.IllegalStateException: Duplicate key tier (attempted merging values io.pivotal.pks.cluster.data.ClusterTagEntity@7ee42af9 and io.pivotal.pks.clu
ster.data.ClusterTagEntity@7ee42afa)] with root cause

java.lang.IllegalStateException: Duplicate key tier (attempted merging values io.pivotal.pks.cluster.data.ClusterTagEntity@7ee42af9 and io.pivotal.pks.cluster.data.ClusterTagEntity@7ee42afa)
        at java.base/java.util.stream.Collectors.duplicateKeyException(Collectors.java:133) ~[na:na]
        at java.base/java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:180) ~[na:na]
        at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169) ~[na:na]
......

The compute-profile is not successfully linked to the cluster at all as shown by tkgi cluster command .

$ tkgi cluster c2

Upgrade is available to PKS Version: 1.13.10-build.10

PKS Version:              1.12.8-build.9
Name:                     c2
K8s Version:              1.21.14
Plan Name:                small
......
Compute Profile Name:     
Tags:                     env:prod,external:no,pci-dss:no,shared:yes,tier:2,tier:3

However the cluster is updated with desired worker instance group as set by --node-pool-instances flag in tkgi update-cluster command.

$ bosh -d service-instance_7ac95839-88ed-4121-a110-7696aefecbb7 is
Using environment '10.225.57.65' as client 'ops_manager'

Task 654. Done

Deployment 'service-instance_7ac95839-88ed-4121-a110-7696aefecbb7'

Instance                                                   Process State  AZ   IPs           Deployment
master/a929240a-41d3-4cf0-97cb-9d5cf91c5a17                running        az1  x.x.x.x  service-instance_7ac95839-88ed-4121-a110-7696aefecbb7
worker-worker-medium/bf2f3678-b0fe-4b03-9836-b2c8a1c1b702  running        az1  y.y.y.y  service-instance_7ac95839-88ed-4121-a110-7696aefecbb7
worker-worker-small/dc1c97a4-de33-48e8-8c14-46d9ec60ba36   running        az1  z.z.z.z  service-instance_7ac95839-88ed-4121-a110-7696aefecbb7

Because compute-profile is missing for the cluster, when upgrade-cluster command is run later, TKGI will bring the cluster back to "worker" instance group according to the plan used when the cluster is initially created. Depending on the discrepancy of worker instance group setting between plan and compute-profile, some running worker instances might be deleted first before BOSH starts migrating cluster to "worker" instance group. For example, user might see following output from upgrade-cluster command.

Task 665

Task 665 | 03:19:42 | Deprecation: Global 'properties' are deprecated. Please define 'properties' at the job level.
Task 665 | 03:19:44 | Preparing deployment: Preparing deployment
Task 665 | 03:20:07 | Warning: DNS address not available for the link provider instance: pivotal-container-service/6c6092f3-66ca-47ef-9cf2-587a20b07f54
Task 665 | 03:20:07 | Warning: DNS address not available for the link provider instance: pivotal-container-service/6c6092f3-66ca-47ef-9cf2-587a20b07f54
Task 665 | 03:20:07 | Warning: DNS address not available for the link provider instance: pivotal-container-service/6c6092f3-66ca-47ef-9cf2-587a20b07f54
Task 665 | 03:20:07 | Warning: DNS address not available for the link provider instance: pivotal-container-service/6c6092f3-66ca-47ef-9cf2-587a20b07f54
Task 665 | 03:20:17 | Preparing deployment: Preparing deployment (00:00:33)
Task 665 | 03:20:17 | Preparing deployment: Rendering templates (00:00:11)
Task 665 | 03:20:29 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 665 | 03:20:31 | Deleting unneeded instances worker-worker-small: worker-worker-small/27f74d8d-7e6b-4a75-9599-f6b5ff3227cf (1) (00:00:52)
......

If many running worker instances are deleted first, the applications on the cluster might be affected a lot or go down completely.

Environment

Product Version: 1.13

Resolution

Product team have acknowledged this is a defect and will prevent user from setting tags with duplicate key names in furture. The fix will be in TKGI release TKGI 1.18.0+, TKGI 1.17.2+, TKGI 1.16.5+, TKGI 1.15.8+.

If you already have TKGI cluster experiencing the issue described in this article, please contact Tanzu support for help.