When using Symantec Encryption Management Server or PGP Universal Server in a VMWare ESX Server environment, the file system may be remounted as read-only and the server may become unavailable after a file system error occurs. File systems can become read-only in the case of a busy I/O retry or path failover.
The section below only applies to versions of PGP Universal Server 2.5.x - 2.12.0 later running on VMWare ESX Server 3.x.x
This problem may occur due to a file system error associated with an adapter issue on the virtual machine. The cause of this error can be a combination of a Linux driver and the virtual LSI device for the virtual machine. This issue may also be related to virtual machines running on SAN or iSCSI storage.
This is a known issue with VMWare and is detailed on the VMWare support site.
Because the problem affects only the LSI Logic adapter in VMWare ESX, it is recommended to use the BusLogic adapter to resolve the issue.
To resolve this problem, you must also restore your server data from a backup. Backups include all information necessary to restore the server to its exact condition when the backup was created, including proxy and policy settings, as well as keys and user information. It is recommended making periodic backups of all of your servers. Each backup is a full backup.
Use the following procedure to use a BusLogic adapter for PGP Universal Server on a virtual machine.
This section applies to versions of PGP Universal Server 3.x and Symantec Encryption Management Server 3.3.x.
On PGP Universal Server 3.x and Symantec Encryption Management Server 3.3.x the use of the LSI SCSI driver is required. The BusLogic driver should no longer be used.
The issue with Linux mounting the file system in read only mode might still occur due to a slow disk backend on the ESX server.
This is a known issue with VMWare and is detailed on the VMWare support site, please refer to VMware KB 51306.
"VMware has identified a problem where file systems may become read-only after encountering busy I/O retry or SAN or iSCSI path failover errors.
The same behavior is expected even on a native Linux environment, where the time required for the file system to become read-only depends on the number of paths available to a particular target, the multi-path software installed on the operating system, and whether the failing I/O was to an EXT3 Journal. However, the problem is aggravated in an ESX host environment because ESX host manages multiple paths to the storage target and provides a single path to the guest operating system, which effectively reduces the number of retries done by the guest operating system."
The following or similar error messages might be seen with command "dmesg" when the issue occurs:
"INFO: task postmaster:10857 blocked for more than 120 seconds."
"Buffer I/O error on device sda2, logical block 23711793
lost page write due to I/O error on sda2
sd 0:0:0:0: timing out command, waited 1080s
sd 0:0:0:0: Unhandled error code
sd 0:0:0:0: SCSI error: return code = 0x06000008"
The following errors are shown when the file system is mounted read only.
"EXT3-fs error (device sda2): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
mptscsih: ioc0: attempting task abort! (sc=f0d76700)"
In this case the database goes into recovery mode and the server is not operational anymore.
Connecting to the database on keys2 failed with an error:
"psql: FATAL: the database system is in recovery mode"
During high disk I/O a very high CPU WAIT percentage around 50% or higher can be see with command "top"
Cpu(s): 0.2%us, 0.0%sy, 0.0%ni, 49.8%id, 50.0%wa, 0.0%hi, 0.0%si, 0.0%st
On VMware ESX server, the following error messages point to the disk not being accessible to the Linux system.
0x2 errors(this status is returned when the HBA driver is unable to issue a command to the device, can occur due to dropped FCP frames)
0x8 errors (returned when the HBA driver aborts I/O or forces a target reset)
To solve this issue, the VMware ESX server must be adjusted to prevent disk read/write failures.