Significant Drop in the Test Execution on BlazeMeter on August 13, 2020 - High Node CPU Usage in Multiple Identity Nodes


Article ID: 197469


Updated On:




I have seen a reduce in no. of test execution and also not able to login for 30 min from 7.15 PM – 7.45 PM IST, we had seen a similar issue this morning as well.
Now the test executions are back normal.
Can you please let me know what is the issue?



Incident details with time: 

  1. 08/13/2020 10:26PM PDT to 10:56PM PDT => 5 incidents were triggered for high node CPU usage in multiple Identity nodes: sev-1
  2. 08/13/2020 10:42PM PDT => 1 incident triggered for drop in API TPS: sev-1

Problem symptoms:

  1. Observed “OperationalError: (psycopg2.OperationalError) ERROR: no such user: xxxxxxxx” exception in Identity service
  2. Observed “health-check” failures in Identity service due to the above 1) error
  3. Observed Identity POD CPU spikes and random restarts, and dependent services including API were failing to connect to Identity and this caused disruption in API traffic inflow.
  4. Observed drops in TPS for API

How we came to know about the problem:

  1. Runscope monitoring system detected identity service POD restarts and triggered a sev-2 slack notification
  2. Runscope monitoring system detected identity service POD high CPU usage (> 70%) and triggered a sev-2 slack notification
  3. Runscope monitoring system detected identity service POD high CPU usage (> 90%) and triggered a sev-1 PagerDuty alerts (Please refer item #1 from above Incidents section for details)
  4. Runscope monitoring system detected a sev-1 breach for API traffic drop and triggered a sev-1 (Please refer item #2 from above Incidents section for details)
  5. Within a few secs after the initial sev-2 was triggered, we observed an “no such user” exception in the Identity service

Root cause of the problem

  1. The Identity POD restarting due to “health-check” failures and subsequent CPU spike alerts with the error “no such user: xxxxxx” in the identity logs while connecting to the identity DB pointed us to the DB layer for further investigation
  2. We further triaged the DB layer and found initially that the “pgbouncer” hosts were 100% on disk – We added additional disk space and quickly restarted all “pgbouncer” hosts
  3. However, the identity service POD restarts, CPU spikes, and other mentioned error conditions still continued to occur, and the issue was still with pgbouncer which was unable to connect to Postgres DB and provide pooled connections to identity service 
  4. This forced us to further investigate the pgbouncer servers and found that pgbouncer PROD auth-config was either wiped out or renamed by “rpmsave”. Which we quickly fixed by recreating/renaming the auth-config on those impacted servers. But we observed this issue recurring randomly on few of those impacted hosts. So, to avoid further widespread impact to the system, we removed those impacted pgbouncer hosts from the identity pool of pgbouncer servers list – and this brought the API traffic back to normal though the identity servers were still overloaded.
  5. This forced us to continue triaging the pgbouncer servers and found that the yum package manager auto-updated pgbouncer package and in that upgrade process it had overwritten/renamed its PROD auth config with default config template and this caused the pgbouncer unable to lookup the PROD DB user reliably from the PROD configs, hence the error and other side effects
  6. So, to fix this pgbouncer issue and bring the identity back to normal, we manually updated pgbouncer package in all the impacted hosts and restored the pgbouncer with clean PROD auth-config and this brought the identity service back to healthy state and also the other reported side effect errors were stopped 



Release : SAAS




What we are doing to avoid recurrences in the future:

  1. We added pgbouncer hosts into monitoring for disk and other regular system metrics monitoring 
  2. We disabled yum “auto-upgrade” for pgbouncer to avoid auto updates.