vSAN Disk Group Offline With Message "Trying to format a valid metadata block"
search cancel

vSAN Disk Group Offline With Message "Trying to format a valid metadata block"

book

Article ID: 326806

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This KB is written to advise that this issue may occur, and direct you to reach out to VMware for assistance with resolving this issue.


Symptoms:
a vSAN disk group is taken offline with the vmkernel log message similar to the below examples (note that specific dates, times, and IDs will be different for your environment):

Example 1:
2020-05-21T15:22:38.514Z cpu1:1000341425)WARNING: PLOG: DDPCacheIOCb:686: Trying to format a valid metadata block, UUID 52fcff55-6866-a3d0-d0d5-ba4e3c1d9362, type 4, pbn 4398046515647
2020-05-21T15:22:38.514Z cpu0:1000214054)WARNING: PLOG: DDPCompleteDDPWrite:6455: Throttled: DDP write failed Invalid metadata callback [email protected]#0.0.0.1, diskgroup 5287714a-e5a0-d986-1f12-e0c960878e53 txnScopeIdx 0
2020-05-21T15:22:38.514Z cpu0:1000214054)PLOG: DDPCompleteDDPWrite:6469: Throttled: (DDPWrite): Curr: completeTask, Prev: updateHashmap, Status: Success
2020-05-21T15:22:38.514Z cpu0:1000214054)WARNING: PLOG: PLOGDDPWriteCbFn:655: DDP write failed on device 52fcff55-6866-a3d0-d0d5-ba4e3c1d9362:Invalid metadata (ssdPerm: no)elevIo 0, doDdpCommit yes
2020-05-21T15:22:38.514Z cpu1:1000213133)WARNING: PLOG: PLOGPropagateError:4232: DDP: Propagating error state from original device 52fcff55-6866-a3d0-d0d5-ba4e3c1d9362
2020-05-21T15:22:38.514Z cpu1:1000213133)WARNING: PLOG: PLOGPropagateError:4284: DDP: Propagating error state to MDs in device 5287714a-e5a0-d986-1f12-e0c960878e53
2020-05-21T15:22:38.514Z cpu1:1000213133)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T4:L0 cState: 0 nState: 6 isLSE: 0
2020-05-21T15:22:38.514Z cpu1:1000213133)WARNING: PLOG: PLOGPropagateErrorInt:4172: Permanent error event on 52fcff55-6866-a3d0-d0d5-ba4e3c1d9362
2020-05-21T15:22:38.514Z cpu1:1000213133)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T3:L0 cState: 7 nState: 7 isLSE: 0
2020-05-21T15:22:38.514Z cpu1:1000213133)WARNING: PLOG: PLOGPropagateErrorInt:4188: Error/unhealthy propagate event on 52934040-4111-9e8c-4d12-ad0f5635b3d6
2020-05-21T15:22:38.514Z cpu1:1000213133)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T6:L0 cState: 7 nState: 7 isLSE: 0
2020-05-21T15:22:38.514Z cpu1:1000213133)WARNING: PLOG: PLOGPropagateErrorInt:4188: Error/unhealthy propagate event on 5287714a-e5a0-d986-1f12-e0c960878e53

Example 2:
2020-05-21T16:36:22.055Z cpu0:1000341426)WARNING: PLOG: DDPCacheIOCb:686: Trying to format a valid metadata block, UUID 528006c4-3f71-81c4-ae10-0ae7d661bba0, type 3, pbn 3298534904346
2020-05-21T16:36:22.055Z cpu1:1000214313)WARNING: PLOG: DDPCompleteDDPWrite:6455: Throttled: DDP write failed Invalid metadata callback [email protected]#0.0.0.1, diskgroup 52379c29-607b-e423-f700-dc4386d74c6a txnScopeIdx 0
2020-05-21T16:36:22.055Z cpu1:1000214313)PLOG: DDPCompleteDDPWrite:6469: Throttled: (DDPWrite): Curr: completeTask, Prev: addNewHash, Status: Success
2020-05-21T16:36:22.055Z cpu1:1000214313)WARNING: PLOG: PLOGDDPWriteCbFn:655: DDP write failed on device 528006c4-3f71-81c4-ae10-0ae7d661bba0:Invalid metadata (ssdPerm: no)elevIo 0, doDdpCommit yes
2020-05-21T16:36:22.058Z cpu0:1000214307)PLOG: PLOGElevHandleFailure:2325: Waiting till we process failure ... dev 528006c4-3f71-81c4-ae10-0ae7d661bba0
2020-05-21T16:36:22.061Z cpu0:1000213234)WARNING: PLOG: PLOGPropagateError:4232: DDP: Propagating error state from original device 528006c4-3f71-81c4-ae10-0ae7d661bba0
2020-05-21T16:36:22.061Z cpu0:1000213234)WARNING: PLOG: PLOGPropagateError:4284: DDP: Propagating error state to MDs in device 52379c29-607b-e423-f700-dc4386d74c6a
2020-05-21T16:36:22.061Z cpu0:1000213234)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T4:L0 cState: 0 nState: 6 isLSE: 0
2020-05-21T16:36:22.061Z cpu0:1000213234)WARNING: PLOG: PLOGPropagateErrorInt:4172: Permanent error event on 528006c4-3f71-81c4-ae10-0ae7d661bba0
2020-05-21T16:36:22.061Z cpu0:1000213234)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T3:L0 cState: 7 nState: 7 isLSE: 0
2020-05-21T16:36:22.061Z cpu0:1000213234)WARNING: PLOG: PLOGPropagateErrorInt:4188: Error/unhealthy propagate event on 52c30f7b-abfb-3bf2-2bb1-6ed690e7d4f3
2020-05-21T16:36:22.061Z cpu0:1000213234)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T6:L0 cState: 7 nState: 7 isLSE: 0
2020-05-21T16:36:22.061Z cpu0:1000213234)WARNING: PLOG: PLOGPropagateErrorInt:4188: Error/unhealthy propagate event on 52379c29-607b-e423-f700-dc4386d74c6a
2020-05-21T16:36:25.915Z cpu0:1000214307)PLOG: PLOGRelogBase:226: RELOG: relogTask exit requested
2020-05-21T16:36:25.915Z cpu0:1000214307)PLOG: PLOGRelogExit:605: RELOG task exiting UUID 52379c29-607b-e423-f700-dc4386d74c6a Success

Example 3:
2020-05-21T16:56:00.941Z cpu1:1000341426)WARNING: PLOG: DDPCacheIOCb:686: Trying to format a valid metadata block, UUID 521d473e-2bd4-d796-b250-0587bd83fae9, type 5, pbn 5497558160057
2020-05-21T16:56:00.941Z cpu0:1000213922)WARNING: PLOG: DDPCompleteDDPWrite:6455: Throttled: DDP write failed Invalid metadata callback [email protected]#0.0.0.1, diskgroup 5247de40-f42b-a0e3-a310-b4e7a2f5cbee txnScopeIdx 0
2020-05-21T16:56:00.941Z cpu0:1000213922)PLOG: DDPCompleteDDPWrite:6469: Throttled: (DDPWrite): Curr: completeTask, Prev: readXmap, Status: Success
2020-05-21T16:56:00.941Z cpu0:1000213922)WARNING: PLOG: PLOGDDPWriteCbFn:655: DDP write failed on device 521d473e-2bd4-d796-b250-0587bd83fae9:Invalid metadata (ssdPerm: no)elevIo 0, doDdpCommit yes
2020-05-21T16:56:00.941Z cpu0:1000213916)PLOG: PLOGElevHandleFailure:2325: Waiting till we process failure ... dev 521d473e-2bd4-d796-b250-0587bd83fae9
2020-05-21T16:56:00.941Z cpu0:1000213152)WARNING: PLOG: PLOGPropagateError:4232: DDP: Propagating error state from original device 521d473e-2bd4-d796-b250-0587bd83fae9
2020-05-21T16:56:00.941Z cpu0:1000213152)WARNING: PLOG: PLOGPropagateError:4284: DDP: Propagating error state to MDs in device 5247de40-f42b-a0e3-a310-b4e7a2f5cbee
2020-05-21T16:56:00.941Z cpu0:1000213152)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T4:L0 cState: 0 nState: 6 isLSE: 0
2020-05-21T16:56:00.943Z cpu0:1000213152)WARNING: PLOG: PLOGPropagateErrorInt:4172: Permanent error event on 521d473e-2bd4-d796-b250-0587bd83fae9
2020-05-21T16:56:00.943Z cpu0:1000213152)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T3:L0 cState: 7 nState: 7 isLSE: 0
2020-05-21T16:56:00.943Z cpu0:1000213152)WARNING: PLOG: PLOGPropagateErrorInt:4188: Error/unhealthy propagate event on 52cd91e1-e659-8d2c-f431-4e1e923217d0
2020-05-21T16:56:00.943Z cpu0:1000213152)PLOG: PLOG_FindAndUpdateDevTelemetryStat:1058: Setting devResState : dev: mpx.vmhba0:C0:T6:L0 cState: 7 nState: 7 isLSE: 0
2020-05-21T16:56:00.944Z cpu0:1000213152)WARNING: PLOG: PLOGPropagateErrorInt:4188: Error/unhealthy propagate event on 5247de40-f42b-a0e3-a310-b4e7a2f5cbee
2020-05-21T16:56:04.066Z cpu0:1000213916)PLOG: PLOGRelogBase:226: RELOG: relogTask exit requested

Environment

VMware vSAN 6.7.x
VMware vSAN 7.0.x

Cause

The behavior leading to the disk group being taken offline from vSAN use was introduced to avoid potential data corruption issues in a scenario where certain metadata blocks are bad or in an inconsistent state.

History:
Prior to vSAN 6.7 release vSAN would re-initialize the block as a bitmap block by discarding any previous allocation in this block, and thus potentially allowing random corruption in user data at a later stage. With the vSAN 6.7 release, a PSOD (purple screen panic) was introduced to avoid this corruption potential. Please see KB 80703 for details around the PSOD.

Removing the disk group from use was introduced as alternate behavior to avoid the PSOD in vSAN 6.7 p05 and 7.0 Update 1.

Resolution

Please work with VMware and your hardware vendor to determine the underlying cause of the inconsistent metadata.

Workaround:
Please contact VMware support to work around this issue and restore the disk group to use.

Additional Information

Impact/Risks:
If a failures to tolerate of 0 policy is in use, data is in a reduced redundancy state, or multiple events occur before data resync or rebuild can occur, then this could lead to a potential data unavailable or data loss scenario.