HCX - Unable to redeploy NE appliance post network unextension workflow failure
search cancel

HCX - Unable to redeploy NE appliance post network unextension workflow failure

book

Article ID: 321620

calendar_today

Updated On:

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

  • This document is created as a reference for the HCX NE Appliance redeployment failure post network unextension workflow and how to recover that.
  • HCX NE Appliance may stuck in unstable state post unextension workflow, as a result any service mesh operation for NE Appliance like "Resync" or "Redeploy" won't be serviced further.
    Below error can be seen in the cloud HCX Cloud Manager log /common/log/admin/app.log:
2023-03-12 00:21:22.210 UTC [InterconnectService_SvcThread-4705, J:d02ceb0d, , TxId:########-####-####-####-############] ERROR c.v.v.h.s.i.ProcessServiceMesh- checkResources failed, errorCode:null. stacktrace:null, errorMessage:Interconnect Service Workflow AllocateResourcesForInterconnectAppliance failed. Error: Error while configuring appliance networks.. Cause: Could not resolve segment /infra/tier-1s/cgw/segments/hcx-ne-########-####-####-####-############ to opaque network. Failed to get realized state. Result: {"status":"failure","statusCode":404,"details":"","result":{"httpStatus":"NOT_FOUND","error_code":500090,"module_name":"Policy","error_message":"Policy object path=[\/infra\/tier-1s\/cgw\/segments\/hcx-ne-########-####-####-####-############] does not exist."}}



Environment

VMware HCX

Cause

When a segment gets extended using a given NE Appliance, HCX Cloud Manager does an API call to create HCX defined segment such as "L2E-#####" on the cloud/dst NSX-T Policy-UI.
If unstretch workflow fails due to some potential issues in the backend system or due to infrastructure, NSX-T doesn't remove or cleanup extended segment "L2E-#####" from its Policy-UI.

IMPORTANT: One of the potential cause where network unextension workflow may fail is due to high memory symptom observed in a given NE Appliance when user tries to extend more than 5 segments per NE Appliance.
This has been documented in Broadcom KB 91086 and fixed in HCX 4.6.1 version onwards.

Due to some reason if user performs cleanup or delete "L2E-#####" segment manually using NSX-T Policy UI or API, the HCX Manager won't get notified by NSX Management layer, which causes HCX to maintain stretched segment backing record in the backend system.
Note: Users are allowed to delete any segment from NSX-T Policy-UI or API if NO VMs attached to that segment.

As a result, when user tries to "Resync" OR "Redeploy" operation for that SM/NE Appliance, it tries to validate the extended segment backing record with NSX-T but it returns "404" error code as "NOT FOUND" since its already deleted from NSX-T.

RECOMMENDATIONS:

  • DO NOT delete or remove any L2E segments from cloud NSX-T until HCX and its associated components are undeployed in a given cloud environment.

Note: L2E segments created on cloud NSX-T as part of failed extension/unextension workflow can be used again for re-extension workflow using HCX.
Note: L2E segments are pretty much same as NSX-T native segments and can be used as a regular segments to connect workload VMs.

  • Also, DO NOT delete or remove any such rules like "HCX-L2CMacDiscoveryProfile" from NSX-T pointing to HCX extensions workflow, which may end up service mesh and NE Appliance redeployment.

Resolution

This is a condition that may occur in a VMware HCX environment.

If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.

Additional Information

Alternatively, depending on the workflow or action, you can create a new service mesh for the new operation.

Another possible resolution will be to un-extend the networks, remove the NE service from the Service Mesh (this will remove NE appliances), and then re-add the NE service to the Service Mesh which will create a new set of NE appliances.

Refer HCX - NE appliance state becomes critical due to Memory component

Impact/Risks:

  • This will impact those NE Appliances used during unextension workflow.
  • This issue will only be surfaced when user removes extended segment (L2E-#####) from cloud NSX-T Policy-UI or API.
  • NE running in HA mode will also be affected.
  • Existing extended datapath won't be affected.
  • There will be NO impact to Interconnect IX Appliances.