[PAM] What happens when VM Datastore runs out of space

Products

CA Privileged Access Manager (PAM)

Issue/Introduction

PAM Virtual Appliance are run on Virtual Platform.

What can happen if the Datastore, which the PAM Virtual Appliance is hosted on, runs out of disc space?

Environment

Release : ANY

Component : PRIVILEGED ACCESS MANAGEMENT

Cause

It is unpredictable as it is a condition that should not happen and must be avoided at all cost.

This condition is not specific to PAM but any Virtual Appliance or OS hosted on a virtual platform.

When the Datastore runs out of space, it is not visible to the hosted Guest OS.

Guest OS has been allocated with a virtual disc space which would not reflect the Datastore's disc space.

Following are some of the known issues that had been encountered.

It is important to note that not all nodes on that Datastore will show exact same symptoms.

1. Communication break down

* Most common.

* PAM nodes hosted on that Datastore suddenly fails to communicate with other nodes. Other nodes will also find these nodes to be unreachable.

2. Unable to logon to GUI or config page

3. No SNMP Trap messages sent out.

4. No emails sent out

5. In case if you were able to SSH to the node, you may find following indicators in the system logs.

[syslog]
Jan 01 21:21:08 PAMHOST pamDatabaseHeartBeat[32222]: read SNMP Config NULL
Jan 01 21:21:08 PAMHOST pamDatabaseHeartBeat[32222]: new lost member(s) found '192.###.###.##1'
Jan 01 21:21:08 PAMHOST pamDatabaseHeartBeat[32222]: update pam_instances to set leader to 0 for lost members: UPDATE pam_instances SET leader = '0' WHERE host_name IN ('192.###.###.##1') OR ip_addr IN ('192.###.###.##1')
Jan 01 21:21:08 PAMHOST pamDatabaseHeartBeat[32222]: Cannot open logfile ()
Jan 01 21:21:18 PAMHOST rc.local[392]: 01/01/20 21:21:18 - Error (110) on writing to keepalive socket for member 192.###.###.##1
Jan 01 21:21:20 PAMHOST rc.local[392]: 01/01/20 21:21:20 - Connection to 192.###.###.##1 has been broken. The gateway is currently reachable.


[messages]
Jan 01 21:56:04 PAMHOST kernel: sd 2:0:0:0: [sda] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jan 01 21:56:04 PAMHOST kernel: sd 2:0:0:0: [sda] tag#20 CDB: opcode=0x2a 2a 00 08 cf d9 df 00 00 08 00
Jan 01 21:56:04 PAMHOST kernel: EXT4-fs warning (device loop5): ext4_end_bio:323: I/O error 10 writing to inode 229601 (offset 0 size 0 starting block 18439092)
Jan 01 21:56:04 PAMHOST kernel: sd 2:0:0:0: [sda] tag#19 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jan 01 21:56:04 PAMHOST kernel: sd 2:0:0:0: [sda] tag#19 CDB: opcode=0x2a 2a 00 08 cf da 47 00 00 08 00
Jan 01 21:56:04 PAMHOST kernel: EXT4-fs warning (device loop5): ext4_end_bio:323: I/O error 10 writing to inode 229601 (offset 0 size 0 starting block 18439105)
Jan 01 21:56:05 PAMHOST kernel: sd 2:0:0:0: [sda] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jan 01 21:56:05 PAMHOST kernel: sd 2:0:0:0: [sda] tag#21 CDB: opcode=0x2a 2a 00 08 cf d9 f7 00 00 08 00
Jan 01 21:56:05 PAMHOST kernel: EXT4-fs warning (device loop5): ext4_end_bio:323: I/O error 10 writing to inode 229601 (offset 0 size 0 starting block 18439095)
Jan 01 21:56:05 PAMHOST kernel: sd 2:0:0:0: [sda] tag#22 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x00
Jan 01 21:56:05 PAMHOST kernel: sd 2:0:0:0: [sda] tag#22 CDB: opcode=0x2a 2a 00 08 cf da 7f 00 00 08 00
Jan 01 21:56:05 PAMHOST kernel: EXT4-fs warning (device loop5): ext4_end_bio:323: I/O error 10 writing to inode 229601 (offset 0 size 0 starting block 18439112)

6. Signs of db corruption

In case if you were able to logon to GUI/Config, you may see partial or invalid screen.

7. Signs of file corruption

Some log files will suddenly show binary data instead of text messages.

8. Reboot the node but only shows "Booting the kernel" and no activities.

* If you are seeing the node is trying to communicate with other PAM nodes after reboot, that is a good sign.

* Identify which node is not attempting to sync with other nodes and stays at "Booting the kernel".

Resolution

In an ideal environment the Datastore should not run out of space and early warnings should alert the administrators to avoid such incident.

In case if this occurred, it is required to clear up space on the Datastore.

Then those PAM nodes that is stuck at "Booting the kernel" need to be reverted to previous snapshot as the current state may already be in damaged state due to unexpected corruption.

You can take a DB export at this point from a working node in case if that need to be restored at the Primary site.

Once you revert the snapshot of the damaged node, it should try to sync up with other nodes and recover itself.

If majority of the Primary Site was lost, you may need to check if the sync'd DB content is up-to-date or you may need to restore the DB.

If there are still signs that the affected nodes are not functioning correctly, those nodes may require reverting to previous snapshot as well.