TAS / PCF Metrics fails to ingest logs with too many clients error
search cancel

TAS / PCF Metrics fails to ingest logs with too many clients error

book

Article ID: 293660

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

This article explains what to do when logs have gone missing in the Metrics User interface and explains why logs-queue will report the "too many clients" error in Tanzu Application Service (TAS), formerly Pivotal Cloud Foundry (PCF).

This article will also help operators understand why app logs do not go back as far as the configured retention level.

Environment

Product Version: 2.6

Resolution

The following symptoms are experienced: 
 
  • Logs disappear or are not available in the Metrics User interface. 
  • Logs do not go back to the config retention window. For example, Metrics only has 3 days of logs when the retention level is set to 14 days.
The logs-queue app shows the following messages:
func2 ERROR: Failed to bulk save logs into persistent data store : pq: sorry, too many clients already

Postgres VM logs may have one of the errors below in the postgresql.log log file:
ERROR:  could not extend file "base/16385/39278931.3": No space left on device

or
ERROR:  new row for relation "app_log_day0_hour23" violates check constraint "app_log_day0_hour23_timestamp_millis_check"

According to the Metrics Architecture Documentation, app logs flow from the Ingestor App to Redis, to the logs-queue app, and finally to the PostgreSQL instance. In situations where there is extremely high app log ingress, the logs-queue app can become unstable if the internal database pruning algorithm is not able to sufficiently keep the postgresql disk usage low.  

By default, the logs-queue app will check whether pruning is needed once every hour. It will then only prune the database if the disk usage reaches 85%. When there are short periods of high log ingress, it is possible to fill up the database faster than logs-queue app can prune.  

Additionally, if you have recently upgraded Metrics for PCF from 1.5.2 to 1.6.0, your postgresql database may still fill up even after tuning the appropriate pruning parameters. Please review the known issues in this doc.
 

Before Making Changes

Before tuning the logs-queue pruning parameters, you need to first understand why there is such a high log volume. This is especially true if you have already gone through the effort of scaling your Loggregator components to meet your ingress demands. Often, there is a small subset of apps that are generating thousands of logs per second. We advise you first identify why the log volume is so high and then determine which steps can be taken to limit the app log ingress activity. 

Here are some ways to identify noisy apps:
  Note: If you find an app emitting millions of logs in just a few minutes, you may need to engage the app developer in order to understand why the app is so chatty.


Workaround

While you investigate why your log volume is high, you can tune the logs-queue pruning environment parameters to help make the pruning algorithm intervals more aggressive so that it keeps up with the large ingress volume. 

First, identify the log volume in gigabytes per day. SSH into the postgresql VM in the PCF metrics deployment and connect to the database.

Note: The table names from app_log_dayX do not map directly to specific days of the week. The apps use an internal algorithm to determine which table to use based on the current epoch date. This means day1 is not necessarily the first table to be used. The first table could be day5 or 11.
$> sudo su - 
$> /var/vcap/packages/postgres-*/bin/psql -p 5524 -U pgadmin metrics

Next, run the query below to dump the sizes of all tables in the database. This will give you a sense of how much log data per hour is being saved into the Postgres database.
select table_name, pg_size_pretty(pg_relation_size(quote_ident(table_name))) from information_schema.tables where table_schema = 'public' order by 2 DESC;

      table_name       | pg_size_pretty
-----------------------+----------------
 app_log_day1_hour1    | 96 MB
 flyway_schema_history | 8192 bytes
 app_log_day9_hour15   | 8040 kB
 app_log_day1_hour2    | 76 MB
 app_log_day7_hour16   | 655 MB
 app_log_day1_hour0    | 54 MB
 app_log_day1_hour4    | 308 MB
 app_log_day12_hour18  | 23 MB

You can chose one of the larger daily tables and sum up all of the hourly partitions. For example, if you want to know how much log data per day you generate, choose app_log_day1 and make sure that you have app_log_day1_hour% in the where filter as per the following example:
select sum(h.size) || ' MB' as size from (select table_name, pg_relation_size(quote_ident(table_name)) /1024/1024 as size  from information_schema.tables where table_schema = 'public' and table_name like 'app_log_day1_hour%') h;

  size
--------
 888 MB
(1 row)

Armed with knowing the amount of log data you need to keep up with, we can now determine the best way to tune the logs-queue pruning parameters. You can also review the formulas documented here in order to determine whether you need to scale the postgres log store to keep up with amount of log data.

Normally, support would recommend the settings below but if you choose these settings, it is important to consider whether the log ingress volume will exceed 25% of your postgres VM persistent disk storage capacity in 15 minutes. You may need to either scale up the disk or tune down these parameters to meet your specific needs.
MAX_RETENTION_PERCENTAGE=75
PG_GROOM_DISK_SIZE_INTERVAL=15m

You can set these parameters from Operations Manager Metrics For PCF tile -> Metrics Component Configs. Once configured in OpsManager, you will need to "Apply Changes" with the push apps errand enabled.
 
  • Logs Disk Size Pruning Interval ==  PG_GROOM_DISK_SIZE_INTERVAL
  • Logs Max Retention Percentage == MAX_RETENTION_PERCENTAGE

Set these parameters with cf CLI. The following step may be required if you cannot edit the settings via OpsManager and successfully execute the Metrics for the PCF push apps errand. It is necessary to modify these same settings in the Ops Manager Tile to prevent a tile update via Ops Manager from reverting settings applied via CF CLI. 
cf target -o system -s metrics-v1-6
cf set-env logs-queue MAX_RETENTION_PERCENTAGE 75
cf set-env logs-queue PG_GROOM_DISK_SIZE_INTERVAL 15m
cf restage logs-queue


Workaround Check Constraint Error

Further correction might be required should you find the too many open connections problem continues after you have resolved the app ingress issues above and the following error is still being observed in the logs-queue app logs. 

pq: new row for relation "app_log_day1_hour16" violates check constraint "app_log_day1_hour16_timestamp_millis_check"


The above error may continue to happen when the database disk was once full and the logs-queue managed to prune out the current day logs. If the current day logs are pruned out then we will see this constraint error.  To solve this condition we have to DELETE all of the app log data and recreate it using the following procedure.
 

1. stop metrics-ingestor app
cf stop metrics-ingestor

2. stop logs-queue app
cf stop logs-queue

3. connect to the metrics database 
/var/vcap/packages/postgres-*/bin/psql -p 5524 -U pgadmin metrics

4. truncate the app_log table and call a procedure that will recreate the app log table metadata
Note: This deletes all existing app log data which cannot be recovered
truncate app_log;
SELECT create_all_app_log_days(current_date);

5. start logs-queue app
cf start logs-queue

6. start metrics-ingestor app
cf start metrics-ingestor

7. Verify from the metric ui app log data is visible