Alarm for nsx-node-agent health status
search cancel

Alarm for nsx-node-agent health status

book

Article ID: 345805

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Title: Alarm for nsx-node-agent health status
Event ID: node_agents_health.node_agents_down
Added in release: 3.0.0
Alarm Description

  • Purpose: Detect and report when the connection between nsx-node-agent and hyperbus is down.
  • Impact: When a scheduler schedules the Pod/Container on the node where nsx-node-agent is not healthy. The Pod/Container cannot go to a running state.

Resolution

For Kubernetes/OpenShift cluster:

  1. On the Kubernetes (K8s) Master VM, check if the connection between the nsx-node-agent container and hyperbus is down. To find the nsx-node-agent Pod name and namespace:
    • kubectl get pods --all-namespaces
  2. Invoke the kubectl command to check the connection status:
    • kubectl exec -it <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent bash
    • nsxcli
    • get node-agent-hyperbus status
  3. If there is an issue with the nsx-node-agent container, use the kubectl logs command to check the issue and fix the error:
    • kubectl logs <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent
  4. Alternatively, try to restart the nsx-node-agent Pod to fix the issue.

For TKGi cluster:

  1. Get the TKGi UUID from the UI and get the deployment name of the cluster:
    • The cluster name is in the form pks-<UUID of cluster>.
    • The cluster deployment name must be service-instance_<UUID of cluster>.
  2. Log in to Operation Manager using SSH, and then invoke the command to list all VMs of the cluster deployment to find worker VMs:
    • bosh vms -d service-instance_<UUID>
  3. Worker VM names are in the form worker/<vm-id>. Log in to the worker VM:
    • bosh ssh -d service-instance_<UUID> worker/<vm id>
  4. Check the nsx-node-agent process status:
    • sudo monit status or sudo monit summary
  5. Alternatively, invoke the command to list all VMs of the deployment and processes running on them with the process status collectively:
    • bosh instances -d service-instance_<UUID> -p
  6. If the nsx-node-agent is not running, go to the nsx-node-agent log folder and check the logs:
    • cd /var/vcap/sys/logs/nsx-node-agent
  7. Alternatively, try to restart nsx-node-agent to fix the issue:
    • sudo monit restart nsx-node-agent

For TAS foundation:

  1. Log in to Operation Manager using SSH, and then invoke the command to list all VMs of the TAS deployment using bosh vms:
    • The TAS deployment name is of the form cf-<deployment id>.
  2. Find VMs of name diego_cell/ on which nsx-node-agent is running as a process.
  3. Log in to each diego_cell VM:
    • bosh ssh -d cf-<deployment id> diego_cell/<instance id>
  4. Check the nsx-node-agent process status:
    • sudo monit status or sudo monit summary
  5. Alternatively, invoke the command to list all VMs of the deployment and processes running on them with the process status collectively:
    • bosh instances -d cf-<deployment id> -p
  6. If the nsx-node-agent is not running, go to the nsx-node-agent log folder and check the logs:
    • cd /var/vcap/sys/logs/nsx-node-agent
  7. Alternatively, try to restart nsx-node-agent to fix the issue:
    • sudo monit restart nsx-node-agent