SSP: Cleaning Stale NSX Service Instances after Service Deployment removal

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

There are multiple scenarios and manifestations of stale service instances being left behind after removing a service deployment. One of the errors is as below

If you are using Malware Prevention Service (MPS)

After deploying MPS on a cluster, we go to check the status of the deployment (IDS/IPS & Malware Prevention → Settings → Shared → Activate Hosts & Clusters for East-West Traffic)

Here, after clicking the Deployment Status to get the overall status as well as the status for each transport node, we see an error like "Error: The requested object : DeploymentUnitInstance/#### could not be found. Object identifiers are case sensitive. (Error code: 600)"

Using API calls to get the status of the deployment returns similar errors.

Environment

SSP 5.0

Cause

Looking at the InstanceRuntimes (output of corfu_tool_runner.py --tool corfu-browser -o showTable -n nsx -t InstanceRuntime), we see a InstanceRuntime which has a different service ID or a different product version from the other instances.

Example:

Key:
{
  "uuid": {                              <------- Note this key for deleting the stale instance runtime.
    "left": "$$$$",
    "right": "$$$$"
  }
}

Payload:
{
  "managedResource": {
    "displayName": "some-svm"
  },
  "serviceInstanceId": {              <----- Note this key for deleting the stale service instance
    "left": "&&&&",            
    "right": "&&&&"
  },
  "deploymentUnitId": {
    "left": "###",
    "right": "###"
  },
  "deploymentInstanceId": {
    "left": "###",
    "right": "###"
  },
  "hostId": "###",
  "svmId": "###:vm-###",
  "vmExternalId": "###",
  "deploymentState": "VM_DEPLOYMENT_STATE_DEPLOYMENT_SUCCESSFUL",
  "runtimeState": "VM_RUNTIME_STATE_IN_SERVICE",
  "vmNicInfo": {
    "nicInfo": [{
      "nicMetadata": {
        "interfaceLabel": "eth",
        "interfaceType": "INTERFACE_TYPE_MGMT",
        "userConfigurable": true
      },
      "networkId": "dvportgroup-##",
      "ipAddress": {
        "ipv4": ###
      },
      "subnetMask": {
        "ipv4": ###
      },
      "gatewayAddress": {
        "ipv4": ###
      },
      "macAddress": {
        "mac": "###"
      },
      "vif": "###",
      "ipPoolId": {
        "left": "###",
        "right": "###"
      },
      "dnsServer": ["###", "###"],
      "dnsSuffix": "###",
      "ipAllocationType": "IP_ALLOCATION_TYPE_STATIC"
    }, {
      "nicMetadata": {
        "interfaceLabel": "eth",
        "interfaceIndex": 1,
        "interfaceType": "INTERFACE_TYPE_CONTROL",
        "userConfigurable": false
      },
      "macAddress": {
        "mac": "###"
      },
      "vif": "###"
    }]
  },
  "markedAsSvm": true,
  "serviceId": {
    "left": "123456789",
    "right": "987654321" <------------ This service ID seems to be old as well as it does not match the other deployments
  },
  "isMacAvailableForAllNic": true
}

Metadata:
{
  "revision": "###",
  "createTime": "###",
  "createUser": "system",
  "lastModifiedTime": "###",
  "lastModifiedUser": "system",
  "productVersion": "3.2.3.1.0" <----- this shows that it was deployed on older NSX
}

.

Resolution

a. Un-deploy the service from the problematic cluster

b. Use the below API to cleanup the identified stale serviceinstances / instanceruntimes

POST https://<Manager-IP>/api/v1/serviceinsertion/services/<Service-ID>/service-instances/<Instance-ID>/instance-runtimes?action=delete

Note: you may use the below script to get the service-id and instance-id from the 'left', 'right' corfu key.

#!/usr/bin//python3

# usage : thistool.py left right

# example:

# user@ubuntu2204:~/tools$ python thistool.py 5309577210414842440 13828241991281864423

# > 49af6857-6e49-4248-bfe7-c8d36e7eeee7

import sys

import uuid

def main():

l=int(sys.argv[1])

r=int(sys.argv[2])

c = uuid.UUID(int=( l << 64 ) + r)

print(c)

if __name__ == "__main__":

main()

c. Re-deploy the service in the cluster

If the above steps do not resolve the issue, please execute the following commands from NSX Manager's root mode:

corfu_tool_runner.py -n nsx -o showTable -t ServiceInstance  > /somelocation/ServiceInstance.txt

corfu_tool_runner.py -n nsx -o showTable -t InstanceEndpoint  > /somelocation/InstanceEndpoint.txt

corfu_tool_runner.py -n nsx -o showTable -t InstanceRuntime  > /somelocation/InstanceRuntime.txt

corfu_tool_runner.py -n nsx -o showTable -t ServiceDeployment  > /somelocation/ServiceDeployment.txt

corfu_tool_runner.py -n nsx -o showTable -t GiNodeSolutionInfo  > /somelocation/GiNodeSolutionInfo.txt

Collect the output, along with a support bundle, and submit a support request. Since the cleanup process involves database modifications, ensure that you have up-to-date backups.