Users access internet via Cloud SWG using IPSEC access method.
IPSEC tunnels exist from Palo Alto Firewalls to locations across all GEOs globally with primary / secondary tunnels for redundancy purposes (all traffic goes to primary unless primary fails).
At one specific location, users were impacted (after hours) during a maintenance activity - a maintenance alert was published, but there were no documented actions to be taken as part of this alert.
Users were unable to access the internet for about 30 minutes before it recovered.
Maintenance activity was done pod per pod (no data center outage) and hence no action was required from IPSEC firewall side.
Palo Alto Firewall.
IPSEC IKEv1 access method into Cloud SWG.
Tunnel Monitor profile existed, but not attached to impacted firewall at problem location.
Make sure that both liveness checks and tunnel monitoring profile exists and are attached to the IPSEC tunnel configuration into Cloud SWG, so that connectivity and IKE protocol events are detected transparently.
A liveness check enabled Dead Peer Detection (DPD) refers to functionality documented in RFC 3706, which is a method of detecting dead Internet Key Exchange (IKE/Phase1) peers.
A tunnel monitoring profile allows you to verify connectivity between the VPN peers; you can configure the tunnel interface to ping a destination IP address at a specified interval and specify the action if the communication across the tunnel is broken.
When a POD or POP goes down for any reason, it is imperative that the IPSEC firewall is setup to detect this condition.
During the Cloud SWG maintenance activity, PODs were temporarily taken out of rotation and any active connections to the disconnected POD should have been detected on the Palo IPSEC firewall.
Typically, during maintenance activities where no action is required for IPSEC, one pod (e.g. DP1) is taken out of load balancer rotation, necessary updates are performed to the pod, and when completed it is added back into rotation before moving on to do the same with the next pod (e.g DP2). When DP1 is out of rotation, the tunnel should have switched to DP2 automatically and your may noticed a small impact (may need to refresh browser for example, or reconnect if using Teams calls as this is badly impacted by change of egress IP address).
Liveness checks, performed at the DPD layer, should have detected the change and negotiated a new session into Cloud SWG. However, the Palo logs indicate that this did not happen until about 45 minutes (16:57) after the POD went down (16:11 UTC) as shown below:
Since the liveness checks and tunnel monitoring profile are responsible for detecting such events, we focussed out attention there before realising that the tunnel monitoring profile was not applied correctly.