Title: Alarm for nsx-node-agent health status Event ID: node_agents_health.node_agents_down Added in release: 3.0.0 Alarm Description
Purpose: Detect and report when the connection between nsx-node-agent and hyperbus is down.
Impact: When a scheduler schedules the Pod/Container on the node where nsx-node-agent is not healthy. The Pod/Container cannot go to a running state.
Environment
VMware NSX-T Data Center
Resolution
For Kubernetes/OpenShift cluster: 1. On K8s Master VM, check the connection between the nsx-node-agent container and hyperbus is down or not. To find nsx-node-agent Pod name and namespace. - 'kubectl get pods --all-namespaces'
And then invoke kubectl command to check connection status. - 'kubectl exec -it <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent bash' - 'nsxcli' - 'get node-agent-hyperbus status'
2. If there is an issue with the nsx-node-agent container, use kubectl logs command to check the issue and fix the error. - 'kubectl logs <nsx-node-agent-Pod-Name> -n <nsx-node-agent-Pod-NameSpace> -c nsx-node-agent'
3. Alternatively, try to restart nsx-node-agent Pod to fix the issue.
For TKGi cluster: 1. Get the TKGi UUID from UI and Get Deployment name of cluster. - Cluster name is in the form pks-<UUID of cluster>. - Cluster deployment name must be service-instance_<UUID of cluster>
2. Login to Operation Manager using SSH, and then invoke command to list all vms of cluster deployment to find worker vms. - bosh vms -d service-instance_<UUID>
3. Worker vms names are in the form worker/<vm-idr>. Login to worker vm. - bosh ssh -d service-instance_<UUID> worker/<vm id>
4. Check nsx-node-agent process status. - 'sudo monit status' or 'sudo monit summary'
5. Alternatively invoke command to list all vms of deployment and processes running on them with process status collectively. - 'bosh instances -d service-instance_<UUID> -p'
6. If nsx-node-agent is not running, go to nsx-node-agent log folder and check nsx-node-agent logs. - 'cd /var/vcap/sys/logs/nsx-node-agent'
7. Alternatively, try to restart nsx-node-agent to fix the issue. - 'sudo monit restart nsx-node-agent'
For TAS foundation: 1. Login to Operation Manager using SSH, and then invoke command to list all vms of TAS deployment. - 'bosh vms' - TAS deployment name is of the form 'cf-<deployment id>'
2. Find vms of name diego_cell/ on which nsx-node-agent are running as process.
3. Login to each diego_cell vm. - 'bosh ssh -d cf-<deployment id> diego_cell/<instance id>
4. Check nsx-node-agent process status. - 'sudo monit status' or 'sudo monit summary'
5. Alternatively invoke command to list all vms of deployment and processes running on them with process status collectively. - 'bosh instances -d cf-<deployment id> -p'
6. If nsx-node-agent is not running, go to nsx-node-agent log folder and check logs. - 'cd /var/vcap/sys/logs/nsx-node-agent'
7. Alternatively, try to restart nsx-node-agent to fix the issue. - 'sudo monit restart nsx-node-agent'