The purpose of this KB is to provide a way to figure out why nsx-sha reports false-positive alarm and how to resolve this issue.
Symptoms:
An alarm reported from an edge node is appears in NSX UI but it is false-positive:
2024-03-11T09:59:34.787Z FATAL pool-64-thread-2 MonitoringServiceImpl 69664 MONITORING [nsx@6876
alarmId="########-####-####-####-############" alarmState="OPEN" comp="nsx-manager" entId="########-####-####-####-############" errorCode="MP701099" eventFeatureName="edge_health" eventSev="CRITICAL" eventState="On" eventType="edge_nic_link_status_down" level="FATAL"
nodeId="########-####-####-####-############" subcomp="monitoring"] Edge node NIC eth0 link is down.
VMware NSX-T Data Center
VMware NSX-T Data Center 3.x
This happens because nsx-sha cannot complete running the command within the specific time properly.
We can see "Timeout req may hanging" message in var/log/syslog:
2024-03-11T09:48:23.148Z SWN-PS-3Z-PB-T0-BASE06-vmedge01.ps.krw.pb NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Req timeout, waiting for 36.31458378955722 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13663975.922738194}
2024-03-11T09:49:23.534Z SWN-PS-3Z-PB-T0-BASE06-vmedge01.ps.krw.pb NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Timeout req may hanging, waiting for 96.7007926274091 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13663975.922738194}
2024-03-11T09:55:25.735Z SWN-PS-3Z-PB-T0-BASE06-vmedge01.ps.krw.pb NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Timeout req may hanging, waiting for 458.90203033946455 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664338.149291515}
2024-03-11T09:55:27.639Z SWN-PS-3Z-PB-T0-BASE06-vmedge01.ps.krw.pb NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING" s2comp="fork-monitor"] Received resp for a timeout req, waiting for 460.8056724201888 seconds: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664398.510184744}, {'seq': 1518, 'type': 0, 'executor': 0, 'timestamp': 13663939.608435009, 'execute_time': 460.79502287879586, 'output': b'up\n', 'error': 'Request timeout when waiting for response'}
2024-03-11T09:55:27.640Z SWN-PS-3Z-PB-T0-BASE06-vmedge01.ps.krw.pb NSX 425692 - [nsx@6876 comp="nsx-edge" subcomp="nsx-sha" username="nsx-sha" level="WARNING"] Failed to run command: {'cmd': ['sudo', 'cat', '/sys/class/net/eth0/operstate'], 'input': None, 'shell': False, 'timeout': 4, 'resp_queue': <queue.Queue object at 0x71c384076e50>, 'env': None, 'type': 0, 'timestamp': 13663939.608154405, 'seq': 1518, 'timed_out': 13663975.922738194, 'timed_log': 13664398.510184744} with error Request timeout when waiting for response
This issue can be resolved by restarting nsx-sha service.
# service nsx-sha restart
Impact/Risks:
Customer can see the alarm in NSX UI but there's no impact.