Host go into not responding state due to storage connectivity issues Error: Failed to reserve space for journal
search cancel

Host go into not responding state due to storage connectivity issues Error: Failed to reserve space for journal

book

Article ID: 389071

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms :

  • None of the commands were executed from Esxi host when the affected LUN is presented to esxi host
  • Host goes into not responding state often

Validation Steps : 

Below Erros in vmkernel.log

Log Path : less /var/run/log/vmkernel.log..


    [YYYY-MM-DDTHH:MM:SS Z] Wa(180) vmkwarning: cpu73:2099594)WARNING: Fil3: 1638: Failed to reserve volume f532 28 1  667e10d2 6a7f0159c7f4410eb45034aa      0 0 0 0 0 0 0
    [YYYY-MM-DDTHH:MM:SS Z] Wa(180) vmkwarning: cpu73:2099594)WARNING: FS3J: 1811: Failed to reserve space for journal on 667e10d2-6a7f0159-410e-###### :  Timeout
    [YYYY-MM-DDTHH:MM:SS Z] Wa(180) vmkwarning: cpu73:2099594)WARNING: Fil3: 1638: Failed to reserve volume f532 28 1 667e10d2 6a7f0159 ######### 0 0 0 0 0      0 0
    [YYYY-MM-DDTHH:MM:SS Z] Wa(180) vmkwarning: cpu62:2097342)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.600##################" state in doubt; requested fast path state update...                 
    [YYYY-MM-DDTHH:MM:SS Z] In(182) vmkernel: cpu2:2098461)ScsiDeviceIO: 4672: Cmd(0x45ba02754280) 0x28, CmdSN 0x56 from world 2099547 to dev naa.600#########" failed  H:0x7 D:0x0 P:0x0                     
    [YYYY-MM-DDTHH:MM:SS Z] In(182) vmkernel: cpu0:2097325)ScsiDeviceIO: 4672: Cmd(0x45ba02697880) 0x28, CmdSN 0x2f from world 2099547 to dev"naa.600#############"failed H:0x7 D:0x0 P:0x0                
    [YYYY-MM-DDTHH:MM:SS Z] In(182) vmkernel: cpu58:2098178)NMP: nmp_ResetDeviceLogThrottling:3854: last error status from device naa.600########## repeated 1 times

Environment

VMware vSphere 8.x

VMware vSphere 7.x

Cause

The journal block leaks happen on VMFS Filesystem in the case of storage connectivity problems while opening/closing volume. When its trying to open a volume we see its failing to reserve the journal space 

The journal block is a crucial part of the transactional consistency mechanism in VMFS. When a volume is accessed or modified, VMFS uses journal blocks to maintain consistency in case of crashes or power loss. If there are issues during the opening or closing of a volume (such as storage device disconnections or delays), VMFS may fail to reserve the necessary journal blocks to ensure data consistency

During the volume mount or dismount process, VMFS will attempt to reserve journal space to ensure that metadata changes are properly recorded and can be recovered. If the storage connectivity is disrupted or the system cannot access the underlying disk blocks for journal reservation, the journal space cannot be properly allocated, leading to "leaks" or untracked space

Cause Validation 

Run VOMA check  using  KB : 318894 to see any corruption occurred for affected datastore 

    root@hostname:/vmfs/volumes/63997788-#####-1f3d-#######/log] voma -m vmfs -f check -d /vmfs/devices/disks/naa.600###############
    Running VMFS Checker version 2.1 in check mode               
    Initializing LVM metadata, Basic Checks will be done         
    Detected valid GPT signatures                             
    Number    Start          End                Type          
     1         2048           34359738334        vmfs          
                                                             
   Checking for filesystem activity
    Performing filesystem liveness check..-Scanning for VMFS-6 host activity (4096 bytes/HB, 1024 HBs).
         ERROR: Failed to check for heartbeating hosts on device '/vmfs/devices/disks/naa.600#############'  >>>>>>>>> Indicates failed to check heartbeat due to storage connectivity issues
     VOMA failed to check device : General Error               
                                                             
     Total Errors Found:           0
     Kindly Consult VMware Support for further assistance      

Resolution

  • To Isolate the issue Unmap LUN which was showing as failed to open volume from all the esxi host and check the host status and hosts were stable.
  •  Present a LUN and check whether the host is behaving slow and not responding.  If yes further checks has to be done from Storage. Coordinate with the storage vendor to identify any potential issues with the underlying storage infrastructure, including SAN devices that might be causing connectivity disruptions. Storage controllers, network switches, or disk arrays may be contributing to intermittent connectivity or I/O performance degradation.