"Duplicate vm extension name" error when metrics_server runs on Director VM in Tanzu Kubernetes Grid Integrated Edition
search cancel

"Duplicate vm extension name" error when metrics_server runs on Director VM in Tanzu Kubernetes Grid Integrated Edition

book

Article ID: 298692

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The metrics_server job in the Director VM is in a crash-loop state with the error "Duplicate vm extension name" when there is more than one active cloud config for pivotal-container-service-<UUID>.

It appears that when the Tanzu Kubernetes Grid Integrated Edition (TKGI) tile is uninstalled, the cloud config is left out and remains active. Upon re-installation of the TKGI tile, there will then be multiple cloud configs for pivotal-container-service-<UUID>.  This causes the metrics_server to fall into a crash-loop state. In the Director VM, monit summary output shows the metrics_server in "Does not exist" state.
$ monit summary
The Monit daemon 5.2.5 uptime: 11d 0h 26m

Process 'nats'                      running
Process 'postgres'                  running
Process 'director'                  running
Process 'worker_1'                  running
Process 'worker_2'                  running
Process 'worker_3'                  running
Process 'worker_4'                  running
Process 'worker_5'                  running
Process 'director_scheduler'        running
Process 'metrics_server'            Does not exist
Process 'director_sync_dns'         running
Process 'director_nginx'            running
Process 'health_monitor'            running
Process 'uaa'                       running
Process 'credhub'                   running
Process 'system-metrics-agent'      running
Process 'system-metrics-server'     running
Process 'blobstore_nginx'           running
System 'system_localhost'           running

The /var/vcap/sys/log/director/metrics_server.stderr.log file shows the following stack trace, as an example.  Note that the VM extension name in the error message may be different between various IaaS platforms, such as AWS, Azure, etc.  In the following sample error message, the VM extension name is “disk_enable_uuid”.  This name may very well be different in other IaaS platforms.  For example, it may be “iam_instance_profile_master“ on AWS.
bosh/0:/var/vcap/sys/log/director# tail -f metrics_server.stderr.log
    from /var/vcap/packages/director/bin/bosh-director-metrics-server:29:in `load'
    from /var/vcap/packages/director/bin/bosh-director-metrics-server:29:in `<main>'
/var/vcap/data/packages/director/9ab6cf0d054129da2585c3d01c752015589a85c7/gem_home/ruby/2.6.0/gems/bosh-director-0.0.0/lib/bosh/director/deployment_plan/cloud_manifest_parser.rb:120:in `parse_vm_extensions': Duplicate vm extension name 'disk_enable_uuid' (Bosh::Director::DeploymentDuplicateVmExtensionName)
    from /var/vcap/data/packages/director/9ab6cf0d054129da2585c3d01c752015589a85c7/gem_home/ruby/2.6.0/gems/bosh-director-0.0.0/lib/bosh/director/deployment_plan/cloud_manifest_parser.rb:16:in `parse'
    from /var/vcap/data/packages/director/9ab6cf0d054129da2585c3d01c752015589a85c7/gem_home/ruby/2.6.0/gems/bosh-director-0.0.0/lib/bosh/director/metrics_collector.rb:131:in `populate_network_metrics'
    from /var/vcap/data/packages/director/9ab6cf0d054129da2585c3d01c752015589a85c7/gem_home/ruby/2.6.0/gems/bosh-director-0.0.0/lib/bosh/director/metrics_collector.rb:105:in `populate_metrics'
    from /var/vcap/data/packages/director/9ab6cf0d054129da2585c3d01c752015589a85c7/gem_home/ruby/2.6.0/gems/bosh-director-0.0.0/lib/bosh/director/metrics_collector.rb:51:in `start'
    from /var/vcap/data/packages/director/9ab6cf0d054129da2585c3d01c752015589a85c7/gem_home/ruby/2.6.0/gems/bosh-director-0.0.0/bin/bosh-director-metrics-server:26:in `<top (required)>'
    from /var/vcap/packages/director/bin/bosh-director-metrics-server:29:in `load'
    from /var/vcap/packages/director/bin/bosh-director-metrics-server:29:in `<main>'


The output from the command "bosh configs" shows more than one active cloud config for pivotal-container-service-<UUID>.

$ bosh configs
Using environment '10.0.0.11' as client 'ops_manager'

ID   Type     Name                                                   Team                                            Created At
43*  cloud    default                                                -                                               2021-08-24 19:52:31 UTC
44*  cloud    pivotal-container-service-7195e4047fc4a0624176         pivotal-container-service-7195e4047fc4a0624176  2021-08-24 20:20:29 UTC
34*  cloud    pivotal-container-service-93677816facf7bfa6caf         pivotal-container-service-93677816facf7bfa6caf  2021-07-13 16:51:37 UTC
46*  cloud    service-instance_6d91f398-e24b-4dbf-adcc-123981b6e19b  pivotal-container-service-7195e4047fc4a0624176  2021-08-26 14:38:21 UTC
5*   cpi      default                                                -                                               2021-07-13 16:02:51 UTC
3*   runtime  director_runtime                                       -                                               2021-07-13 16:02:51 UTC
1*   runtime  ops_manager_dns_runtime                                -                                               2021-07-13 16:02:50 UTC
2*   runtime  ops_manager_system_metrics_runtime                     -                                               2021-07-13 16:02:51 UTC
45*  runtime  p-compliance-scanner-8829c4bd1fce99029ccb-oscap        -                                               2021-08-26 14:14:16 UTC

(*) Currently active
Only showing active configs. To see older versions use the --recent=10 option.

9 configs

Succeeded
$


The following command tell you what the current deployment name of the TKGI tile is:

$ bosh deployments --column=name| grep pivotal-container-service
pivotal-container-service-7195e4047fc4a0624176
$


Compare the deployment name with the cloud config names. In this example, cloud config with ID 44 is the one being used currently because its name is the same as the current deployment. Cloud config with ID 34 doesn't relate to any current deployment and is most likely left out from previous TKGI installation. 


Environment

Product Version: 1.11

Resolution

To work around this issue, the cloud config that does not relate to any current deployment have to be manually removed.

1. Identify the deployment name of the TKGI tile.
$ bosh deployments --column=name| grep pivotal-container-service
pivotal-container-service-7195e4047fc4a0624176
$

2. Identify the ID of the cloud config that doesn't relate to the current deployment of the TKGI tile. In the below example, the cloud config with ID 34 doesn't relate to the current deployment of TKGI and has to be manually removed.
$ bosh configs
Using environment '10.0.0.11' as client 'ops_manager'

ID   Type     Name                                                   Team                                            Created At
43*  cloud    default                                                -                                               2021-08-24 19:52:31 UTC
44*  cloud    pivotal-container-service-7195e4047fc4a0624176         pivotal-container-service-7195e4047fc4a0624176  2021-08-24 20:20:29 UTC
34*  cloud    pivotal-container-service-93677816facf7bfa6caf         pivotal-container-service-93677816facf7bfa6caf  2021-07-13 16:51:37 UTC
46*  cloud    service-instance_6d91f398-e24b-4dbf-adcc-123981b6e19b  pivotal-container-service-7195e4047fc4a0624176  2021-08-26 14:38:21 UTC
5*   cpi      default                                                -                                               2021-07-13 16:02:51 UTC
3*   runtime  director_runtime                                       -                                               2021-07-13 16:02:51 UTC
1*   runtime  ops_manager_dns_runtime                                -                                               2021-07-13 16:02:50 UTC
2*   runtime  ops_manager_system_metrics_runtime                     -                                               2021-07-13 16:02:51 UTC
45*  runtime  p-compliance-scanner-8829c4bd1fce99029ccb-oscap        -                                               2021-08-26 14:14:16 UTC

(*) Currently active
Only showing active configs. To see older versions use the --recent=10 option.

9 configs

Succeeded
$

3. Remove the cloud config that has been identified in the previous step.
$ bosh delete-config 34
Using environment '10.0.0.11' as client 'ops_manager'

Continue? [yN]: y

Succeeded
$
$ bosh configs
Using environment '10.0.0.11' as client 'ops_manager'

ID   Type     Name                                                   Team                                            Created At
43*  cloud    default                                                -                                               2021-08-24 19:52:31 UTC
44*  cloud    pivotal-container-service-7195e4047fc4a0624176         pivotal-container-service-7195e4047fc4a0624176  2021-08-24 20:20:29 UTC
46*  cloud    service-instance_6d91f398-e24b-4dbf-adcc-123981b6e19b  pivotal-container-service-7195e4047fc4a0624176  2021-08-26 14:38:21 UTC
5*   cpi      default                                                -                                               2021-07-13 16:02:51 UTC
3*   runtime  director_runtime                                       -                                               2021-07-13 16:02:51 UTC
1*   runtime  ops_manager_dns_runtime                                -                                               2021-07-13 16:02:50 UTC
2*   runtime  ops_manager_system_metrics_runtime                     -                                               2021-07-13 16:02:51 UTC
45*  runtime  p-compliance-scanner-8829c4bd1fce99029ccb-oscap        -                                               2021-08-26 14:14:16 UTC

(*) Currently active
Only showing active configs. To see older versions use the --recent=10 option.

8 configs

Succeeded
$

Once the duplicate cloud config has been removed, the metrics_server is expected to stop crash-looping and become healthy.