This article provides information on how to clean up the broken EAM agencies without VXLAN / data plane interruption.
After using the below approach, it is expected to re-install the NSX to recreate the EAM agency for the cluster through the web UI. The estimated complete time for NSX host Preparation can be estimated as follows:
host_preparation_complete_time = number_of_hosts / 5 * 20 secs
Since EAM processes at most 5 at a time, every time there is a VIB present you have to wait 20 seconds.
If the purpose is to remove NSX installation for a broken cluster, then please ensure to use command line manually remove the NSX VIBS on the related hosts. For example, # esxcli software vib remove -n esx-nsxv
This article describes a general approach to delete an EAM agency for which is identified as the cause of NSX host preparation failure. For the known issues related to host preparation, see NSX Host Preparation fails with an EAM error: "Host is no longer in vCenter inventory" (52550).
After the use of this approach, the cluster will be shown as "uninstalled" on the vSphere UI. However, the hosts within the cluster will still have the NSX VIB and VTEP thus VXLAN traffic should remain working. It is expected to re-install the NSX to recreate the EAM agency for the cluster afterwards.
VMware NSX Data Center for vSphere
As described in NSX Host Preparation fails with an EAM error: "Host is no longer in vCenter inventory" (52550), this known issue causes host preparation failure when moving hosts in and out of a NSX prepared cluster during the EAM down time.
During the host preparation, EAM creates an agency for a corresponding cluster, and creates an agent for each host. If the host preparation failure is due to an agent namely host, it is required to identify the agency that owns the agent, and then use this approach to delete the agency. The benefit of using this approach is for the scenario when potentially multiple agents in the same agency have issue.
If the host preparation failure is due to an agent, namely host, it is required to identify the agency that owns the agent and use this approach to delete the agency. The agency can be identified from the eam.log.
For example: Run script # grep -Ei "failed to load | disposing agency" eam.log
. The entry "disposing agency ...
" is the agency ID to be removed.
2018-05-22T14:14:46.961+03:00 | ERROR | eam-0 | AgencyImpl.java | 1895 | Failed to load agent: ManagedObjectReference: type = Agent, value = ########-####-####-####-6295cb48329f, serverGuid = ########-####-####-####-ac5b655d8136. Error: Host not covered by scope anymore
2018-05-22T14:14:46.962+03:00 | WARN | eam-0 | AgencyImpl.java | 1954 | Disposing agency: AgencyImpl(ID:'Agency:########-####-####-####-ed26e3cd7041:########-####-####-####-ac5b655d8136') due to failed load up.
Details for step 5:
# select objectid, name from domain_object where objectid='<deploymentunit_ID>';
curl
# curl -i -k -H "X-NSX-Username:admin" -H "Content-Type:application/xml" -X DELETE 'http://<localhost>:7441/api/2.0/si/deploy/service/<service_ID>?clusters=<cluster_ID>'
-i
with curl to get the return code select objectid,name from domain_object where objectid='deploymentunit-#';
objectid | name
------------------+-------------------------
deploymentunit-# | domain-##_service-#
# curl -i -k -H "X-NSX-Username:admin" -H "Content-Type:application/xml" -X DELETE 'http://<localhost>:7441/api/2.0/si/deploy/service/service-#?clusters=domain-##'
Details for step 6:
# update VPX_EXT_DATA set ext_id = 'com.vmware.vim.eam.backup' where data_key like '%<AGENCY_ID>%' or data_value like '%<AGENCY_ID>%';
Details for step 7:
# select data_key, data_value from vpx_ext_data where data_key like '%EsxAgentManager:agency%' order by data_key and ext_id = 'com.vmware.vim.eam';
# update VPX_EXT_DATA set data_key = âÂÂ########-####-####-####-cb7447a29dfe::EsxAgentManager:EsxAgentManager:agency[5]' where data_key = âÂÂ########-####-####-####-cb7447a29dfe::EsxAgentManager:EsxAgentManager:agency[10] ';