REDO log corruption is reported after restoring the virtual machine

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Virtual machine fails to Power ON after restoring it from snapshot LUN or replica LUN.
The REDO log corruption message is reported in the hostd logs and virtual machine log.
In the /var/log/hostd.log file, you see entries similar to:

yyyy-mm-ddThh:mm:ss.xxxZ [3F040B90 verbose 'Vmsvc.vm:/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VM_NAME/VM_NAME.vmx'] Handling message _vmx2: The redo log of VM_NAME-000001.vmdk is corrupted. If the problem persists, discard the redo log.
In the vmware.log file, you see entries similar to:

yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| I120: Msg_Question:
yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| I120: [msg.hbacommon.corruptredo] The redo log of VM_NAME_2-000001.vmdk is corrupted. If the problem persists, discard the redo log.
yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| I120: ----------------------------------------
yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| I120: MsgQuestion: msg.hbacommon.corruptredo reply=0
yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| I120: Exiting because of failed disk operation.
yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| W110: A core file is available in "/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VM_NAME/vmx-zdump.000"
yyyy-mm-ddThh:mm:ss.xxxZ| vcpu-0| W110: Writing monitor corefile "/vmfs/volumes/XXXXXXXX-XXXXXXXX-XXXX-XXXXXXXXXXXX/VM_NAME/vmmcores.gz"

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware ESX Server 8.0.x
VMware ESX Server 7.0.x
VMware ESX Server 6.0.x

Cause

This issue might occur when powering on a VM with vSphere snapshots in these scenarios:

The VMFS datastore on which the VM is hosted is a replica of a different VMFS datastore.
The VM is restored from a storage based snapshot of a VMFS datastore or of an NFS share, before powering on.

The delta disk metadata in-memory of vSphere host includes the delta disk header. Updates to the header of the delta disks happen in memory as required and the changes are written to disk only upon certain events such as snapshot consolidation or when the delta disk is closed.

Storage snapshot operations and storage replications are transparent to ESXi hosts. If the storage snapshot used to restore a VM was taken before the snapshot header changes were flushed to disk, then delta disk metadata on the restored VM is not consistent. Similarly, a synchronous or asynchronous replica of VMFS filesystem might not contain all header changes as they might have not been flushed at the moment replication of underlying LUN was stopped.

Note: The corrupt redo log message just indicates that the in-memory delta disk metadata was not in-sync with on-disk metadata when the storage snapshot was taken or at the time when LUN replication was stopped.

Resolution

To avoid this issue, follow the best practices:

Ensure that the virtual machines are not running on snapshot when a storage snapshot is taken.
Perform storage array or filer snapshots during times when virtual machines snapshots are less likely to happen
Restore virtual machines from snapshot LUN that were taken when the virtual machines were either powered off or when there were no snapshots runinng on VM.
Minimize the frequency of storage array or filer snapshots to have lesser overlap with manual or backup initiated VM snapshots.

Additional Information

Even when a storage snapshot has the ability to take snapshot of a VM, only crash consistency with respect to concurrent I/Os is guaranteed.
This means that all in-flight I/Os on the array or filer will be allowed to complete before taking the storage snapshot. This does not involve:

Virtual machine quiescing (either filesystem or application consistent).
In-memory state of the delta disks in vSphere (as explained above).

To be altered when this article is updated, Subscribe to Document in the Actions box.

仮想マシンの復旧後に REDO ログの破損が報告される