vCenter Web UI not able to connect due to corrupted IPTable

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

It is possible that during an upgrade the iptable of vCenter could be corrupted. The corruption could cause connectivity issue requiring the recovery of the iptable. The symptoms can mimic a firewall rule created erroneously in NSX firewalls if NSX is deployed in the environment. If NSX is not deployed, then it could be an external firewall or host firewall.

Environment

vCenter

Cause

During an upgrade the vCenter upgrade fails to fully recover the iptable of the VCSA.

Resolution

The resolution is to recover the iptable from a backup of the iptable or a default iptable.

To isolate the issue to the "iptable" of the VCSA the following test can be performed.

Ping Test:
1. Open an SSH session to the VCSA.
2. Enter the VCSA shell by entering at the prompt "shell"
3. At the prompt ping the default gateway of the vCenter.
4. If it can successfully ping the default gateway attempt to ping another IP not in the same subnet as the VCSA.
This will prove that layer 1 through 3 is functional. The issue exists in layer 4 and above. This is where firewall rules become the most likely issue.

Packet Capture:
The packet capture will reveal how traffic is passing between endpoints. The expectation is there will be ingress and egress network traffic.
The ARP request is the first step in layer 2 connections. The ARP request will ask "who has IP #.#.100.100. Tell IP #.#.101.50
To complete ARP request and ARP reply is generated. IP #.#.100.100 will see in the reply "#.#.101.50 is at <MAC Address>"
Now the ARP is successfully completed.
If there is no ARP reply generated, the IP in the request must be identified. Is this IP the problem host that would prevent the VCSA from connecting to its web UI?
If the ARP has completed successfully, then the analysis of the packets leave and arriving at the source is complete. There must be two specific IP addresses representing the two endpoints of interest.
The analysis looks at IP #.#.100.100 as the source and IP #.#.101.50 as the destination. This is unidirectional. If this is expected to be bidirectional traffic there is an expectations that IP #.#.101.50 will become the source and IP #.#.100.100 will become the destination at some point. This would be a successful bidirectional connection. When this is the expectation and the packet show the same source IP and destination IP not changing places then this is a sign of traffic blocking. The investigation become focused on where the expected packet is stopped.
This is an example of how the capture assists in the troubleshooting process of a connection issue.
The capture is done on the ESXi host that is supporting the VCSA virtual machine.

Find the switchport for the VCSA:
1. Open an SSH session to the ESXi host
2. Log in as root
3. At the prompt execute esxcli vm process list
4. In the output search for the VCSA and record the World ID (wid)
5. At the prompt execute esxcli network vm port list -w <wid>
The wid use was recorded from esxcli vm process list command.
6. record the "Port ID" given for the VCSA and the Team Uplink.
This information will be used for the subsequent packet capture command.

Capturing the VCSA Packets:
This command is executed on the same ESXi host and SSH session supporting the VCSA virtual machine.
1. Using the port information gathered execute
pktcap-uw --switchport <switchport ID> --capture VnicTx,VnicRx --ng -o - | tcpdump-uw -enr -
2. Allow this to run for a few seconds and then stop it by using "Ctrl C".
Observe that traffic for bidirectional traffic. Note what traffic is flowing.

The IP addresses have been partially masked. The illustration only shows the last two octets and port number. If traffic is bidirectional for most packets and the VCSA default gateway is pingable, then this is evidence that layers 1 through 3 (switching and routing) are working normally. It is reasonable to suspect filtering (Firewalls, IDPS, et al) as the issue. A review of the NSX firewall rules or any other firewall show be done next. Once it is confirmed that there are no firewall rules stopping the traffic it will become evident that the VCSA is the source of the issue. The VCSA is a Linux distribution. All Linux distributions use "iptables" for native firewalling. This is a guest OS issue at this point and outside of core networking or NSX. The compute SME should be engaged to assist with making any changes to the VCSA iptables.

Restoring the VCSA IP table.
This method has two assumptions.
One, that there has been a backup of the iptables made prior to the event. This is not a default action of vCenter.
Two, that a copy of an iptables from another VCSA is available of the same version. This can then be copied and placed in the problematic VCSA root or /tmp directory to be used as the restoring file.

Linux has two commands for this purpose.

Restore from backup file.
1. Open and SSH session to the VCSA
2. Open the shell to run cli commands by entering "shell" at the prompt.
3. If a know backup of the IP table is know run from the prompts:
iptables-restore < <directory>/<filename.fw>

The change will take immediate effect.

Create backup file.
1. Open and SSH session to the VCSA
2. Open the shell to run cli commands by entering "shell" at the prompt.
3. From the prompt execute the following command:
iptables-save > <directory>/<filename.fw>
This will create a backup file of the current iptable. This can be used as a recovery for the VCSA or other VCSA of the same version.

This method can help recover a VCSA with a corrupted iptables as long a you have access to the VCSA's shell and ability to copy files into the problematic VCSA. There is also a method to edit the iptables which is outside the scope of this article.