Unable to expand VxRail cluster | Adding a host Task stuck at Fetching subtask info
search cancel

Unable to expand VxRail cluster | Adding a host Task stuck at Fetching subtask info

book

Article ID: 314645

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

The purpose of this KB is workaround the issues described above and add a VxRail host successfully to the cluster in SDDC Manager.


Symptoms:

When adding a VxRail host to expand a WLD cluster in SDDC Manager:
 

  • Task is stuck in fetching information
  • Errors showing execution workflow
     
  • Task never completes from the UI and is reported as failed from SDDC DB
domainmanager=# select * from task where id ='a4ff8635-####-####-####-##########6c';

id                      | a4ff8635-####-####-####-##########6c
resource_id             | 93303e5d-####-####-####-##########62
resource_type           | ESX_HOST
state                   | COMPLETED_WITH_FAILURE
description             | Adding new host(s) to vxrail cluster
errors                  | [{"messageBundle":"com.vmware.evo.sddc.common.core.error.messages","errorCode":"VCF_ERROR_INTERNAL_SERVER_ERROR","arguments":[],
"message":"A problem has occurred on the server. Please retry or contact the service provider and provide the reference token.","cause":
[{"type":"com.vmware.evo.sddc.common.services.error.SddcManagerServicesIsException","message":"Error in getting workflow options for addition of host to cluster. Check logs"},
{"type":"com.vmware.evo.sddc.common.vxrail.error.VxRailManagerException","message":"Unable to fetch details for port groups managed by VxRail Manager vxrm.gsslabs.com"}],"referenceToken":"ONPAQ3"}]
timestamp               | 1674144689474
completion_timestamp    |
localizable_description | null​​​​
 
  • domainmanager.log shows the specific VxRail API that is timing out:
2023-01-19T20:52:50.118+0000 DEBUG [vcf_dm,63bb90392797462f,03ea] [c.v.v.secure.http.HttpClientService,dm-exec-5] 
Making request: GET https://vxrm.gsslabs.com:443/rest/vxm/v1/system/cluster-portgroups/esx07.gsslabs.com
...
...
2023-01-19T20:52:51.695+0000 ERROR [vcf_dm,2f2578c538a84ba1,559a] [c.v.v.v.h.w.VxRailHostWorkflowInitiator,dm-exec-6] Failed to start workflow for add host task a4ff8635-####-####-####-##########6c
com.vmware.evo.sddc.common.services.error.SddcManagerServicesIsException: Error in getting workflow options for addition of host to cluster. Check logs
    at com.vmware.evo.sddc.common.services.adapters.workflow.options.WorkflowOptionsAdapterImpl.getWorkflowOptionsForAddHostToVxRailCluster(WorkflowOptionsAdapterImpl.java:269)
    at com.vmware.vxrail.vcf.hostmanager.workflows.VxRailHostWorkflowInitiator.startWorkFlow(VxRailHostWorkflowInitiator.java:151)
    at com.vmware.vxrail.vcf.hostmanager.workflows.VxRailHostWorkflowInitiator$$FastClassBySpringCGLIB$$13eaaa4f.invoke(<generated>)
...
...
Caused by: com.vmware.evo.sddc.common.vxrail.error.VxRailManagerException: Unable to fetch details for port groups managed by VxRail Manager vxrm.gsslabs.com
    at com.vmware.evo.sddc.common.vxrail.VxRailManagerService.getVxRailSystemTrafficPortGroups(VxRailManagerService.java:1213)
    at com.vmware.evo.sddc.common.vxrail.VxRailManagerService.getVxRailSystemTrafficPortGroups(VxRailManagerService.java:1277)
...
...
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)

 

 



Cause

This issue is caused when the API response from the VxRail Manager for getting the cluster-portgroups takes more than 1.5 minutes.

The API in question (as reported in the logs above) is:

curl -k -X GET --user '[email protected]:<sso_password>' https://<VxRail_Manager>/rest/vxm/v1/system/cluster-portgroups/<VxRail_Host_Name>


For Example:

curl -k -X GET --user '[email protected]:$ecretPa55' https://vxrm.gsslabs.com:443/rest/vxm/v1/system/cluster-portgroups/esx07.gsslabs.com


The timeout value configured in the domainmanager service is 1.5 minutes. So if the API takes longer than that to respond, the task fails with the errors reported above.

Resolution

To resolve the issue, we need to address why the VxRail Manager is taking an extended amount of time to respond to the GET API call to return the cluster-portgroups.
 

On the SDDC Manager, we can workaround this temporarily by increasing the timeout value for the domainmanager service. The steps for this are provided below.


Workaround:

0. Take a snapshot of the SDDC VM.

1. 
SSH to the SDDC Manager with the vcf user, and su root.

2. Edit the file: /etc/vmware/vcf/domainmanager/application-prod.properties

vi /etc/vmware/vcf/domainmanager/application-prod.properties


3. Add the following entry to edit the timeout value to 300,000 ms (i.e 5 minutes)
Note: The default value is 90000 ms (i.e 1.5 minutes)

http.client.timeout.milis=300000


4. Save the file and quit
ESC and :wq!


5. Restart domainmanager service using the command

systemctl restart domainmanager


6. Wait for the service to come up
 

7. Re-try adding the VxRail host to the cluster.
Reference Document: 
Add the VxRail Hosts to the Cluster in VMware Cloud Foundation

This time the task should progress forward, and we should see the status of task with its sub-tasks and additional details in the SDDC Manager UI.


Additional Information

Impact/Risks:
MINIMAL: The workaround describes steps on increasing the timeout value for the domainmanager service. Since a configuration is changed on the SDDC Manager, a snapshot of the SDDC Manager VM is recommended.