Symptoms:
Pivotal has released stemcells to mitigate security vulnerabilities due to CVE-2017-5754, CVE-2017-5715, and CVE-2017-5753 (aka Spectre and Meltdown).
Pivotal Engineering is currently carrying out performance testing to measure the impact of the new stemcell versions (stemcells that have the Spectre patches) on Pivotal Cloud Foundry (PCF) components. Once the results of the tests become available, we will update this article with additional guidance on determining and mitigating the performance impact on these components.
Currently, we have seen multiple PCF installations facing following issues because of the Spectre and Meltdown.
Diego Performance
The stemcell versions that address Meltdown vulnerabilities impact the performance of the Diego BBS and Diego Cell VMs. Pivotal recommends scaling these Diego components before deploying the new stemcell.
Diego BBS Scaling Recommendations:
Depending on the configuration of the VM hypervisor, CPU usage on the Diego BBS VMs has been observed to increase by between 10% and 50%.
Because the Diego BBS component has only one active instance at a time, operators should scale the Diego BBS VMs in Pivotal Application Service (formerly Elastic Runtime) vertically to compensate for this CPU usage increase. If the total CPU usage on the Diego BBS VMs is greater than 60% before hypervisor and stemcell updates, operators should increase the CPU count on these VMs until CPU usage is 60% or lower and then upgrade the stemcell. If there is sufficient CPU usage headroom after applying the updates, the BBS VMs can then be scaled down.
Diego Cell Scaling Recommendations:
Operators can also expect application instances to increase their CPU usage.
To compensate for this increase, Pivotal recommends scaling up the Diego Cell count to maintain 65% CPU usage or lower before hypervisor and stemcell updates are applied. If average CPU usage across the Cells remains below 85% after the updates, the Cell VM count can then be decreased.
Application Start Timeouts
Because of this performance degradation, application developers may observe an increase in the time required to start their application instances. Developers should determine how long their application instances take to become healthy by looking at the time elapsed between the Starting health monitoring of container
and Container became healthy
log messages for the starting instance. In cases where the instance start time is already close to the default 60-second limit that the Cloud Controller specifies or to the developer-configured value for that application, the developer may need to increase this timeout to accommodate a longer startup duration. This can be accomplished by using the -t
option of the cf push
command. The maximum value for the timeout on Pivotal Cloud Foundry is 600 seconds.
Router Performance
Upgrading to the updated stemcell versions impacts the performance of the Gorouter. Pivotal has observed a 10% decrease in the maximum throughput of the Gorouter as well as a marginal increase in its latency at high throughput rates. These numbers may vary depending on the IaaS.
Router Scaling Recommendations:
Pivotal recommends keeping the CPU utilization of the Gorouter VM within 60-70%. For more information, see Key Capacity Scaling Indicators.
If your Gorouter CPU utilization is currently greater than 60%, Pivotal recommends scaling the Gorouter VMs vertically or horizontally before upgrading to the new stemcell and monitoring the latency, throughput, and CPU usage of the Gorouter. If the numbers remain below the suggested range, you can scale the VMs down after the update.