The Artemis cluster is the internal mechanism used to facilitate cell-to-cell communication. On occasion, this mechanism degrades and cell-to-cell communication falters along with it. When cell-to-cell communication degrades, task processing can take considerably longer. This occurs because the cell that handles a particular task is unable to relay an update of the tasks completion as the internal communication mechanism is non-functional
Symptoms:
VCD environments gradually degrade and task processing becomes slower and slower
Commonplace operations (power on/off a VM, modify the configuration of a VM, instantiate a new VM/vApp, etc.) take exceptionally long to complete
Tasks that normally only take seconds or a few minutes are taking 10 minutes or more to complete
This issue can usually be identified within the cell-runtime.log file; specifically you want to validate the expected Artemis cluster topology against the real Artemis cluster topology
In the screenshot below, you'll see the expected Artemis cluster topology highlighted in yellow (as well as the IP of the missing cell), and the real Artemis cluster topology highlighted in red; the disparity between the expected value and real value indicates that the Artemis cluster has degraded and is missing a member, thus inducing slow task processing
Environment
VMware Cloud Director for Service Provider 10.x VMware Cloud Director 10.x
Cause
Versions 10.3.X and 10.4.X are susceptible to degradation of the mechanisms that facilitate task processing. The Artemis cluster topology is known to lose participating members and thus induce slowness in the environment
Resolution
This issue has persisted on versions 10.3.X and 10.4.X
Workaround:
This issue can be temporarily bypassed by performing a rolling reboot on all cells, or alternatively, by running vmware-vcd services on exclusively the primary cell
For versions 10.4.X, the following configuration changes can be implemented to reduce the impact of this issue
Please note: All cell-management-tool commands listed below only need to be executed on the primary cell
Set the connectionTTL to 90s. The default is 60s: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n "jms.cluster.connectionTTL" -v "90000"
Set the clientFailureCheckPeriod to 45s: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n "jms.cluster.clientFailureCheckPeriod" -v "45000"
Set the Task Poller retrieval interval to 60s - this polls vCenter for task updates: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n vc-task-completions-retrieval-timer-interval-sec -v 60
Set the Activity Poller retrieval interval to 60s - this polls data from the activity table for completion of activities: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n vcloud.activities.activityRelayPollingIntervalMs -v 60000
Set the VCD inventory timeout to 600s: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n InventoryWait -v 600000
Set the Event Processor duration to 120s: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n event.processor.running.duration.millisec -v 120000
Perform a shutdown and restart of the vmware-vcd services on ALL cells in the environment: service vmware-vcd restart
Additional Information
To verify if the values are already in place, use the "-l" option in the commands.
Example:
# /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n "jms.cluster.connectionTTL" -l Property "jms.cluster.connectionTTL" has value "90000"
Impact/Risks:
Slow task processing can result in issues with failover add-ons, as well as general discontent with the quality of the VCD experience