Logical data corruption in database virtual machines caused by SAN outage

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

VM running database applications (e.g., PostgreSQL) report logical data inconsistencies
The underlying ESXi hosts report SCSI I/O timeouts, retries, and path failures on specific Host Bus Adapters (HBAs) in the /var/log/vmkernel.log.
Crucially, the VMW_SATP_ALUA plugin reports the path state as (UP) while simultaneously experiencing timeouts.

2026-02-08T09:28:56.835Z In(182) vmkernel: cpu135:3375713)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:1005: Path "vmhba5:C0:T4:L###" (UP) command 0x12 failed with status Timeout. H:0x3 D:0x0 P:0x0 .
2026-02-08T09:28:56.835Z Wa(180) vmkwarning: cpu135:3375713)WARNING: VMW_SATP_ALUA: satp_alua_getTargetPortInfo:190: Could not get page 83 INQUIRY data for path "vmhba5:C0:T4:L###" - Timeout (195887137)
2026-02-08T09:28:57.094Z In(182) vmkernel: cpu73:2097444)ScsiDeviceIO: 4633: Cmd(...) 0x88, CmdSN 0xb7 from world 2130992 to dev "naa.62<REDACTED>8b3" failed H:0xc D:0x0 P:0x0
2026-02-08T09:29:06.630Z Wa(180) vmkwarning: cpu45:2097962)WARNING: VMW_SATP_ALUA: satp_alua_getTargetPortInfo:190: Could not get page 83 INQUIRY data for path "vmhba5:C0:T7:L###" - No connection (195887168)
2026-02-08T10:04:51.060Z In(182) vmkernel: cpu65:2097450)ScsiDeviceIO: 4670: Cmd(...) 0x28, cmdId.initiator=... CmdSN 0x53867b from world 0 to dev "naa.62<REDACTED>8b3" failed H:0x5 D:0x0 P:0x0 Cancelled from device layer.

Environment

ESXi 8.x
ESX 9.x

Cause

This issue is not caused by VMFS, guest OS filesystem, or storage array LUN corruption.
It is a logical data mismatch caused by a physical hardware failure in the SAN fabric specifically, a failing Supervisor (SUP) module on a Cisco SAN switch.
The switch enters a "limbo" state where the physical optical link to the ESXi host remains active (UP), but the switch's fabric silently drops the SCSI frames.

Resolution

The ESXi host and NMP operated as designed; ESXi cannot proactively bypass a path if the physical link remains active and no immediate error codes are returned by the fabric indicating path down.

Additional Information

DB IO lifecycle - Example Workflow for better understanding.

Healthy IO flow: The Write
- The Database Layer (PostgreSQL fsync)
  When a user executes a COMMIT command, PostgreSQL must guarantee that data is safe. It takes the transaction data from its memory (WAL buffers) and issues a system call (usually fsync() or fdatasync()) to force the Write-Ahead Log (WAL) to permanent storage.
- The Guest OS & Virtual Hardware (PVSCSI)
  The Linux operating system inside your VM takes that fsync() request, translates it into block-level SCSI commands, and hands it to the virtual storage controller—in this case, the VMware Paravirtual SCSI (PVSCSI) adapter.
- The Hypervisor Layer (ESXi & VMHBA5)
  The ESXi host kernel intercepts the SCSI command from the VM. The host's multipathing software (NMP) looks at its routing table, sees that VMHBA5 is the active, healthy path, and pushes the Fibre Channel frames out of that physical host bus adapter port.
- The SAN Fabric Layer (Cisco SAN Switch)
  The light pulses travel over the fiber cable and enter the Cisco SAN switch. Because the XBAR (Crossbar Fabric) is healthy, the switch instantly processes the routing header and forwards the frames out of the correct port toward the storage array without dropping a single packet.
- The Storage Array Layer (EMC Cache)
  The EMC Storage Array receives the frames. Enterprise arrays like EMC do not write directly to the spinning disks or flash cells immediately. Instead, the write lands in the array's highly redundant, battery-backed Write Cache (NVRAM).
- The Acknowledgement
  As soon as that data hits the EMC Cache, it is considered mathematically safe. Now, the system races back up the stack to tell PostgreSQL the good news.
- The "SCSI Good Ack"
  The EMC array instantly generates a SCSI Good acknowledgement and sends it back up the wire. It travels back through the healthy Cisco SAN switch, hits VMHBA5 on the ESXi host, and is passed through the hypervisor back to the PVSCSI controller.
- The OS Confirmation
  The Linux OS receives the hardware acknowledgement and tells the PostgreSQL process, "fsync() request is complete."
- Transaction Committed
  PostgreSQL officially marks the transaction record as COMMITTED in its internal state. It then sends a success message back to the end-user or application that initiated the query.

Note: In a healthy state, this entire round-trip (from Step 1 to Step 8) happens in sub-millisecond times (often less than 0.5ms on modern All-Flash arrays). PostgreSQL's entire data integrity model relies on the assumption that if it receives that "SCSI Good Ack" from the OS, the data is permanently safe on disk.

Impacted IO flow
- Fluctuating Link
  - When a Cisco MDS switch enters this specific type of "black hole" state, it becomes the worst-case scenario for any database.
  - Because the line card lasers are still emitting light, the ESXi host is completely blind to the fact that the switch's brain (the SUP) has died.
  - Here is the exact flow of what happened to your PostgreSQL database:
    - I/O Drop
      - PostgreSQL issues the fsync() command to write the Write-Ahead Log (WAL) to disk.
      - The Linux OS hands the command to the ESXi host.
      - Because the physical link to the Cisco switch still shows "UP", ESXi's Native Multipathing Plugin (NMP) happily sends the SCSI write frames down VMHBA5.
      - The frames hit the Cisco switch and vanish into the black hole. No SCSI Good ACK is ever generated.
    - Database Timeout
      - ESXi does not immediately failover because the link is still physically up. It simply waits for an ACK that will never come.
      - Inside the VM, the Linux OS waits.
      - PostgreSQL hits its internal or application-level statement_timeout.
      - The database assumes the transaction failed, aborts the write in its memory, and rolls the transaction back.
    - Path Failover
      - Eventually, the ESXi host realizes the I/O has been stuck for too long. It triggers a "Dead Path, Dead I/O" (DPDO) sequence.
      - ESXi sends a TEST_UNIT_READY (TUR) probe. Because the switch is in a black hole, the probe fails.
      - Only now does ESXi officially mark VMHBA5 as DEAD and fail over to the healthy VMHBA2.
    - Result
      - This delayed failover creates the exact window for Logical Inconsistency.
      - If PostgreSQL rolled back the transaction out of frustration, but the delayed SAN failover suddenly allowed that trapped I/O packet to reach the EMC array via the secondary path, the storage array will write data that the database believes was aborted.
      - If this happens, PostgreSQL's built-in data integrity checks will often detect the torn page or mismatched WAL sequence upon the next read, which can trigger a deliberate database condition to prevent further silent corruption.
    - Summary:
      - With faulty SUP module, a switch silently drops packets while keeping the optical link "UP," database will suffer the timeouts and rollback risks shown on the right side of the diagram.
      - As DB support team explains it is not a corruption but it can be a logical mismatch between storage data and the DB data.