After upgrading to Gemfire 10.0.8, some nodes show the error below, and do not show up in the output of gfsh> list members even if the process is up.
A DiskAccessException has occurred while writing to the disk for disk store [DISK_STORE_NAME]. The cache will be closed.
org.apache.geode.cache.DiskAccessException: For Region: /__PR/_B__[REGION_NAME]_[BUCKET_ID]: Failed reading from /[BASE_DIR]/[PROFILE_NAME]/data/[CLUSTER_NAME]/servers/[HOST_NAME].[SERVER_NAME]/[DISK_STORE_NAME]/BACKUP[DISK_STORE_NAME]_[OPLOG_ID]. oplogID, [OPLOG_ID] Offset being read=9038402 Current Oplog Size=10327851 Actual File Size,10327851 IS ASYNCH MODE,false IS ASYNCH WRITER ALIVE=false, caused by java.io.IOException: Input/output error
at gemfire//org.apache.geode.internal.cache.Oplog.basicGetForCompactor(Oplog.java:5480)
at gemfire//org.apache.geode.internal.cache.Oplog.getBytesAndBitsForCompaction(Oplog.java:4143)
at gemfire//org.apache.geode.internal.cache.Oplog.compact(Oplog.java:5940)
at gemfire//org.apache.geode.internal.cache.DiskStoreImpl$OplogCompactor.compact(DiskStoreImpl.java:2920)
at gemfire//org.apache.geode.internal.cache.DiskStoreImpl$OplogCompactor.run(DiskStoreImpl.java:2980)
at gemfire//org.apache.geode.internal.cache.DiskStoreImpl$2.run(DiskStoreImpl.java:4563)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.io.IOException: Input/output error
at java.base/java.io.RandomAccessFile.readBytes(Native Method)
at java.base/java.io.RandomAccessFile.read(RandomAccessFile.java:405)
at java.base/java.io.RandomAccessFile.readFully(RandomAccessFile.java:469)
at gemfire//org.apache.geode.internal.cache.persistence.UninterruptibleRandomAccessFile.readFully(UninterruptibleRandomAccessFile.java:95)
at gemfire//org.apache.geode.internal.cache.persistence.UninterruptibleRandomAccessFile.readFully(UninterruptibleRandomAccessFile.java:89)
at gemfire//org.apache.geode.internal.cache.Oplog.basicGetForCompactor(Oplog.java:5457)
All Gemfire 10.x.x versions
The primary failure was a java.io.IOException: Input/output error encountered during the operation of the OplogCompactor for the disk store.
Specific Error: Failed reading from the operation log file: /[BASE_DIR]/[PROFILE_NAME]/data/[CLUSTER_NAME]/servers/[HOST_NAME].[SERVER_NAME]/[DISK_STORE_NAME]/BACKUP[DISK_STORE_NAME]_[OPLOG_ID]
GemFire is designed to close the cache automatically when a disk access exception occurs to prevent data corruption. This led to a cascading shutdown of all distribution managers and membership services.
Investigate the underlying host hardware for disk health issues or file system corruption at the directory path: /[BASE_DIR]/[PROFILE_NAME]/data/[CLUSTER_NAME]/servers/[HOST_NAME].[SERVER_NAME]/[DISK_STORE_NAME]/
Before attempting to restart the node and after ensuring file system health, validate the offline disk store mentioned in the logs by running the following command:
gfsh validate offline-disk-store --name=[DISK_STORE_NAME] --disk-dirs=/[BASE_DIR]/[PROFILE_NAME]/data/[CLUSTER_NAME]/servers/[HOST_NAME].[SERVER_NAME]/[DISK_STORE_NAME]/