Symptoms:
When creating Supervisor and Guest cluster through NSX Application Platform (NAPP); Guest cluster creation hangs in "Creating" state with VMOP error "400 Bad Request:" and CLS error "An error occurred: future must be done"
Environment Information:
Errors:
The specific problem signature shows up as follows:
1. TKC shows “Creating” in vSphere WCP UI
2. vSphere UI -> WCP NS -> k8s events show:
machinehealthcheck for worker and CP shows these reconcile error messages show in vSphere UI:
“error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster”
3. WCPMachine reconcile error message and VM createfailure message for the CP node show in vSphere UI:
“vm is not yet created vmware-system-capw-controller-manager/WCPMachine/“
4. And you may see errors deploying from content library:
“deploy from content library failed for image “xxxxx”: 400 Bad Request: {“type”.”com.vmware.vapi.std.errors.invalid_argument”.”value”:{“error_type”:”INVALID_ARGUMENT”,”messages”:[{“args”:[“future must be done”],”default_message”:”An error occurrred: future must be done”,”id”:”com.vmware.vdcs.util.unhandled_error”}]}}
5. VMOP logs on all SV CP VMs show "400 Bad Request", "future must be done" errors:
./master-vm-848127.tgz_extracted/wcp-agent-422e39967b1144172b78c4120e8e2b20-2023-03-23--20.49-80574/var/log/pods/vmware-system-vmop_vmware-system-vmop-controller-manager-7c556d5f9f-2lvv5_da00a708-de95-43ca-b65e-b42e37f8f772/manager/3.log:2023-03-23T20:49:35.074339517Z stderr F E0323 20:49:35.074232 1 virtualmachine_controller.go:263] VirtualMachine "msg"="Failed to reconcile VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": 400 Bad Request: {\"type\":\"com.vmware.vapi.std.errors.invalid_argument\",\"value\":{\"error_type\":\"INVALID_ARGUMENT\",\"messages\":[{\"args\":[\"future must be done\"],\"default_message\":\"An error occurred: future must be done\",\"id\":\"com.vmware.vdcs.util.unhandled_error\"}]}}" "name"="nsx-01/napp-cluster-01-control-plane-zcwjw"
6. CLS shows "vim.fault.NoPermission":
2023-03-23T21:06:19.965Z | ERROR | 4104bad3-9fb5-4e96-99cc-a1bbf17d56e8-da-40 | cls-simple-activity-20 | EnsureTaskRegisteredActivity | Cannot change state for ManagedObjectReference: type = Task, value = task-2674878, serverGuid = 5d8c93ad-dbda-42e9-a24c-0cb7fcc7aef2 from queued to running. Runtime error reported for task.setState (vim.fault.NoPermission) { faultCause = null, faultMessage = null, object = ManagedObjectReference: type = ResourcePool, value = resgroup-848141, serverGuid = 5d8c93ad-dbda-42e9-a24c-0cb7fcc7aef2, privilegeId = Task.Update, missingPrivileges = null }. retrying...
7. CLS then shows:
2023-03-23T21:00:08.350Z | DEBUG | b1e2dc39-880e-45e3-8f6a-6ae44d4b1a21-1b | cls-simple-activity-9 | ApiMethodSkeleton | Method implementation threw a VMODL2 error com.vmware.vapi.std.errors.InvalidArgument: InvalidArgument (com.vmware.vapi.std.errors.invalid_argument) => { messages = [LocalizableMessage (com.vmware.vapi.std.localizable_message) => { id = com.vmware.vdcs.util.unhandled_error, defaultMessage = An error occurred: future must be done, args = [future must be done], params = <null>, localized = <null> }], data = <null>, errorType = INVALID_ARGUMENT } at sun.reflect.GeneratedConstructorAccessor423.newInstance(Unknown Source) ~[?:?] <SNIP>
8. The NSX APP appliance shows these errors:
E0323 19:46:24.246564 1 vmprovider.go:214] vsphere "msg"="Clone VirtualMachine failed" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": 400 Bad Request: {\"type\":\"com.vmware.vapi.std.errors.invalid_argument\",\"value\":{\"error_type\":\"INVALID_ARGUMENT\",\"messages\":[{\"args\":[\"future must be done\"],\"default_message\":\"An error occurred: future must be done\",\"id\":\"com.vmware.vdcs.util.unhandled_error\"}]}}" "vmName"="nsx-01/napp-cluster-01-control-plane-zcwjw" E0323 19:46:24.246588 1 virtualmachine_controller.go:748] VirtualMachine "msg"="Provider failed to create VirtualMachine" "error"="deploy from content library failed for image \"ob-18900476-photon-3-k8s-v1.21.6---vmware.1-tkg.1.b3d708a\": 400 Bad Request: {\"type\":\"com.vmware.vapi.std.errors.invalid_argument\",\"value\":{\"error_type\":\"INVALID_ARGUMENT\",\"messages\":[{\"args\":[\"future must be done\"],\"default_message\":\"An error occurred: future must be done\",\"id\":\"com.vmware.vdcs.util.unhandled_error\"}]}}" " name"="nsx-01/napp-cluster-01-control-plane-zcwjw"
VMware vSphere 7.0 with Tanzu
An issue was discovered where there is a problem with the group membership of the vpxd-extension user(s).
vpxd-extension user(s) may need to be removed from the Administrators group:
Example of what these users might look like:
['vpxd-extension-########-####-####-####-############', > 'vpxd-extension-########-####-####-####-############']
These vpxd-extension-#### user(s) should be added to the ServiceProviderUsers group.
Please contact VMware Support to have them assist you in running the fixAdministratorsGroup.py script.
This will fix the membership of the vpxd-extension users.