NSX managers became unreachable after fixing internal certificate expiry alarms
search cancel

NSX managers became unreachable after fixing internal certificate expiry alarms

book

Article ID: 412313

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Internal certificate expiry alarms were fixed after executing CARR script as per KB 369034 on all 3 NSX managers. 
  • After executing, the script, the Host and Edge Transport  nodes went disconnected from managers as shown below

  • NSX managers were rebooted to establish connection but after reboot NSX managers became inaccessible as eth0 interface lost IP after reboot. 

Environment

4.1.0.2.0.21761691

Cause

After reboot of NSX managers, IP on eth0 interface gets removed because "sw_integrity_checker" is getting failed due to previous upgrade failures if any.

Check the file in folder "/system/systemctl_--all_--no-pager_status"

* networking.service - Raise network interfaces
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
     Active: active (exited) since Tue xxxx-xx-## ##:##:## UTC; xh xxmin ago
       Docs: man:interfaces(5)
    Process: 754 ExecStartPre=/bin/sh -c [ "$CONFIGURE_INTERFACES" != "no" ] && [ -n "$(ifquery --read-environment --list --exclude=lo)" ] && udevadm settle (code=exited, status=0/SUCCESS)
    Process: 757 ExecStartPre=/opt/vmware/nsx-node-api/bin/set_params.sh (code=exited, status=0/SUCCESS)
    Process: 758 ExecStartPre=/opt/vmware/sw-integrity-checker/verify_ifl (code=exited, status=0/SUCCESS)
    Process: 46488 ExecStart=/sbin/ifup -a --read-environment --exclude=eth0:1 (code=exited, status=0/SUCCESS)
    Process: 46556 ExecStartPost=/opt/vmware/sw-integrity-checker/check_failure (code=exited, status=0/SUCCESS)
   Main PID: 46488 (code=exited, status=0/SUCCESS)

<month> ## ##:##:## <NSXMgr> systemd[1]: networking.service: Found left-over process 46483 (verify_ifl) in control group while starting unit. Ignoring.
<month> ## ##:##:## <NSXMgr> systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
<month> ## ##:##:## <NSXMgr> systemd[1]: networking.service: Found left-over process 46485 (verify_ifl) in control group while starting unit. Ignoring.
<month> ## ##:##:## <NSXMgr> systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
<month> ## ##:##:## <NSXMgr> systemd[1]: networking.service: Found left-over process 46486 (openssl) in control group while starting unit. Ignoring.
<month> ## ##:##:## <NSXMgr> systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
<month> ## ##:##:## <NSXMgr> systemd[1]: networking.service: Found left-over process 46487 (cut) in control group while starting unit. Ignoring.
<month> ## ##:##:## <NSXMgr> systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
<month> ## ##:##:## <NSXMgr> check_failure[46556]: Disabling system network
<month> ## ##:##:## <NSXMgr> systemd[1]: Finished Raise network interfaces.

Resolution

This issue is being fixed from NSX releases 4.2.4 and 9.0.2 onwards.

 

Additional Information

As a workaround, perform below steps on the affected NSX managers and reboot appliance 

  • Check if files .sw_integrity_check_failed and .sw_integrity_check exists in /image folder;
  • If those files exists, delete the files and try restarting network service 'systemctl restart networking.service'.

If the IP does not go down after restarting network service from console then reboot appliance to check the behavior.