Significant Drop in the Test Execution on BlazeMeter on August 13, 2020 - High Node CPU Usage in Multiple Identity Nodes
Updated On:17-08-2020 11:31
I have seen a reduce in no. of test execution and also not able to login for 30 min from 7.15 PM – 7.45 PM IST, we had seen a similar issue this morning as well.
Now the test executions are back normal.
Can you please let me know what is the issue?
Incident details with time:
08/13/2020 10:26PM PDT to 10:56PM PDT => 5 incidents were triggered for high node CPU usage in multiple Identity nodes: sev-1
08/13/2020 10:42PM PDT => 1 incident triggered for drop in API TPS: sev-1
Observed “OperationalError: (psycopg2.OperationalError) ERROR: no such user: xxxxxxxx” exception in Identity service
Observed “health-check” failures in Identity service due to the above 1) error
Observed Identity POD CPU spikes and random restarts, and dependent services including API were failing to connect to Identity and this caused disruption in API traffic inflow.
Observed drops in TPS for API
How we came to know about the problem:
Runscope monitoring system detected identity service POD restarts and triggered a sev-2 slack notification
Runscope monitoring system detected identity service POD high CPU usage (> 70%) and triggered a sev-2 slack notification
Runscope monitoring system detected identity service POD high CPU usage (> 90%) and triggered a sev-1 PagerDuty alerts (Please refer item #1 from above Incidents section for details)
Runscope monitoring system detected a sev-1 breach for API traffic drop and triggered a sev-1 (Please refer item #2 from above Incidents section for details)
Within a few secs after the initial sev-2 was triggered, we observed an “no such user” exception in the Identity service
Root cause of the problem
The Identity POD restarting due to “health-check” failures and subsequent CPU spike alerts with the error “no such user: xxxxxx” in the identity logs while connecting to the identity DB pointed us to the DB layer for further investigation
We further triaged the DB layer and found initially that the “pgbouncer” hosts were 100% on disk – We added additional disk space and quickly restarted all “pgbouncer” hosts
However, the identity service POD restarts, CPU spikes, and other mentioned error conditions still continued to occur, and the issue was still with pgbouncer which was unable to connect to Postgres DB and provide pooled connections to identity service
This forced us to further investigate the pgbouncer servers and found that pgbouncer PROD auth-config was either wiped out or renamed by “rpmsave”. Which we quickly fixed by recreating/renaming the auth-config on those impacted servers. But we observed this issue recurring randomly on few of those impacted hosts. So, to avoid further widespread impact to the system, we removed those impacted pgbouncer hosts from the identity pool of pgbouncer servers list – and this brought the API traffic back to normal though the identity servers were still overloaded.
This forced us to continue triaging the pgbouncer servers and found that the yum package manager auto-updated pgbouncer package and in that upgrade process it had overwritten/renamed its PROD auth config with default config template and this caused the pgbouncer unable to lookup the PROD DB user reliably from the PROD configs, hence the error and other side effects
So, to fix this pgbouncer issue and bring the identity back to normal, we manually updated pgbouncer package in all the impacted hosts and restored the pgbouncer with clean PROD auth-config and this brought the identity service back to healthy state and also the other reported side effect errors were stopped
Release : SAAS
Component : BLAZEMETER GENERAL PLATFORM ISSUE
What we are doing to avoid recurrences in the future:
We added pgbouncer hosts into monitoring for disk and other regular system metrics monitoring
We disabled yum “auto-upgrade” for pgbouncer to avoid auto updates.