Alarm for nsx-node-agent health status

Products

VMware NSX

Issue/Introduction

Title: Alarm for nsx-node-agent health status
Event ID: node_agents_health.node_agents_down
Added in release: 3.0.0
Alarm Description

Purpose: Detect and report when the connection between nsx-node-agent and hyperbus is down.
Impact: When a scheduler schedules the Pod/Container on the node where nsx-node-agent is not healthy. The Pod/Container cannot go to a running state.

Environment

VMware NSX-T Data Center

Resolution

For Kubernetes/OpenShift cluster:
1. On K8s Master VM, check the connection between the nsx-node-agent container and hyperbus is down or not.
To find nsx-node-agent Pod name and namespace.
- 'kubectl get pods --all-namespaces'

And then invoke kubectl command to check connection status.
- 'kubectl exec -it <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent bash'
- 'nsxcli'
- 'get node-agent-hyperbus status'

2. If there is an issue with the nsx-node-agent container, use kubectl logs command to check the issue and fix the error.
- 'kubectl logs <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent'

3. Alternatively, try to restart nsx-node-agent Pod to fix the issue.

For TKGi cluster:
1. Get the TKGi UUID from UI and Get Deployment name of cluster.
- Cluster name is in the form pks-<UUID of cluster>.
- Cluster deployment name must be service-instance_<UUID of cluster>

2. Login to Operation Manager using SSH, and then invoke command to list all vms of cluster deployment to find worker vms.
- bosh vms -d service-instance_<UUID>

3. Worker vms names are in the form worker/<vm-idr>. Login to worker vm.
- bosh ssh -d service-instance_<UUID> worker/<vm id>

4. Check nsx-node-agent process status.
- 'sudo monit status' or 'sudo monit summary'

5. Alternatively invoke command to list all vms of deployment and processes running on them with process status collectively.
- 'bosh instances -d service-instance_<UUID> -p'

6. If nsx-node-agent is not running, go to nsx-node-agent log folder and check nsx-node-agent logs.
- 'cd /var/vcap/sys/logs/nsx-node-agent'

7. Alternatively, try to restart nsx-node-agent to fix the issue.
- 'sudo monit restart nsx-node-agent'

For TAS foundation:
1. Login to Operation Manager using SSH, and then invoke command to list all vms of TAS deployment.
- 'bosh vms'
- TAS deployment name is of the form 'cf-<deployment id>'

2. Find vms of name diego_cell/ on which nsx-node-agent are running as process.

3. Login to each diego_cell vm.
- 'bosh ssh -d cf-<deployment id> diego_cell/<instance id>

4. Check nsx-node-agent process status.
- 'sudo monit status' or 'sudo monit summary'

5. Alternatively invoke command to list all vms of deployment and processes running on them with process status collectively.
- 'bosh instances -d cf-<deployment id> -p'

6. If nsx-node-agent is not running, go to nsx-node-agent log folder and check logs.
- 'cd /var/vcap/sys/logs/nsx-node-agent'

7. Alternatively, try to restart nsx-node-agent to fix the issue.
- 'sudo monit restart nsx-node-agent'