Alarm for NCP health

Products

VMware NSX VMware NSX Networking

Issue/Introduction

Title: Alarm for NCP health
Event ID: ncp_health.ncp_plugin_down
Added in release: 3.0.0
Alarm Description

Purpose: Detect and report when the NCP plugin is down.
Impact: The existing containers which are running will not have any impact. Any new changes to the existing container will fail. No new container creation would be possible.

Environment

VMware NSX-T Data Center

VMware NSX

Resolution

To find the clusters that are having issues, use the NSX UI and navigate to the Alarms page. The Entity name value for this alarm instance identifies the cluster name. Or invoke the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status to fetch all cluster statuses and determine the name of any clusters that report DOWN or UNKNOWN. Then on the NSX UI Inventory | Container | Clusters page find the cluster by name and click the Nodes tab which lists all Kubernetes, OpenShift, SupervisorCluster, and TKGi/TAS cluster members.

For Kubernetes/OpenShift/SupervisorCluster cluster:
1. Check NCP Pod liveness by finding the K8s leader node from all the cluster members and log onto the leader node.
Then invoke the kubectl command 'kubectl get pods --all-namespaces' to find NCP Pod status is running well or not.
On leader node VM to list NCP Pod status and find NCP Pod name and namespace.
- 'kubectl get pods --all-namespaces'

If there is an issue with the NCP Pod, use kubectl logs command to check the issue and fix the error.
- 'kubectl logs <NCP-Pod-Name> -n <NCP-Pod-NameSpace>'

2. Check the connection between NCP and Kubernetes API server. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following commands from the leader VM.
- 'kubectl exec -it <NCP-Pod-Name> -n <NCP-Pod-NameSpace> bash'
- 'nsxcli'
- 'get ncp-k8s-api-server status'
If there is an issue with the connection, check both the network and NCP configurations.

3. Check the connection between NCP and NSX Manager. The NSX CLI can be used inside the NCP Pod to check this connection status by invoking the following command from the leader VM.
- 'kubectl exec -it <NCP-Pod-Name> -n <NCP-Pod-NameSpace> bash'
- 'nsxcli'
- 'get ncp-nsx status'
If there is an issue with the connection, check both the network and NCP configurations.

4. Alternatively, try to restart NCP Pod to fix the issue.

5.Check if the WCP cluster that reported as UNKNOWN is whether a stale or not. For details, refer KB:375626

For TKGi cluster:
1. Get Deployment name of cluster.
- Cluster name is in the form pks-<UUID of cluster>
- Cluster deployment name must be service-instance_<UUID of cluster>

2. Login to Operation Manager using SSH, and then invoke command to list all vms of the cluster to find master name.
- bosh vms -d service-instance_<UUID>

3. Master vm name is in the form master/<random generated number>. Login to NCP Master vm.
- bosh ssh -d service-instance_<UUID> master/<vm id>

4. Check NCP process status.
- 'sudo monit status' or 'sudo monit summary'

5. If NCP is not running, go to NCP log folder and check NCP logs.
- 'cd /var/vcap/sys/logs/ncp'

6. Alternatively, try to restart NCP to fix the issue.
- 'sudo monit restart ncp'

For TAS foundation:
1. Login to Operation Manager using SSH, and then invoke command to list all vms of TAS deployment.
- 'bosh vms'
- TAS deployment name is of the form 'cf-<deployment id>'

2. Find vm of name diego_database/<instance id> on which ncp is running as process.

3. Login to diego_database vm.
- 'bosh ssh -d cf-<deployment id> diego_database/<instance id>'

4. Check NCP process status.
- 'sudo monit status' or 'sudo monit summary'

5. If NCP is not running, go to NCP log folder and check NCP logs.
- 'cd /var/vcap/sys/logs/ncp'

6. Alternatively, try to restart NCP to fix the issue.
- 'sudo monit restart ncp'

Additional Information

Admin Guide: https://docs.vmware.com/en/VMware-NSX-Container-Plugin/4.1/ncp-kubernetes/GUID-FB641321-319D-41DC-9D16-37D6BA0BC0DE.html