Error "VM Protect failed to allocate %d bytes, %d MB available"

Products

VMware Tanzu Greenplum Greenplum Pivotal Data Suite Non Production Edition VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

A query fails, producing the error message: "VM Protect failed to allocate %d bytes, %d MB available".

This document explains the error message and some basic troubleshooting steps that can be taken.

Environment

Cause

What does the error message "VM Protect failed to allocate %d bytes, %d MB available" mean?

The gp_vmem_protect_limit parameter limits the amount of memory that can be allocated on the segments to queries. If the attempt to allocate memory to a query causes the gp_vmem_protect_limit to be exceed, the query will be canceled.

Note: The error message will be recorded in the database logs.

This is a self defence mechanism introduced to avoid over-consumption of the memory by Greenplum Database (GPDB) queries.

Resolution

What actions can be taken to troubleshoot the issue?

Note: Always validate that the tables are vacuumed and analyzed.

There are multiple reasons queries running on a system produce such an error. First try to determine if the issue is or is not attributed to concurrency.

1. Analyze the master and segment logs for the necessary time window to isolate if the issue is query specific or most of the queries failed with the same error message.

egrep "Out of memory|VM Protect" gpdb-YYYY-MM-DD_nnnnnn.log

2. If there is only a limited number of queries which failed with the error message, identify and analyze the queries. Execute the queries on the database to identify the error is reproducible.

If the issue is reproducible, proceed further to extract answers to the below questions.

Is it a new query or how long has the query run without issues?
Has the data changed, for example, the number of records in the source tables?
Was there any database parameter recently changed or any database upgrade?

Below is the list of checks to identify why the query is consuming such a high amount of memory.

a. Verify that all the tables used in the query are analyzed.

b. Review the explain plan of the query to verify any abnormalities, specifically:

Nested Loops: If the explain plan suggest that the query is using Nested Loops, try disabling Nested Loops to force a Hash join.

set enable_nestloop=off ;
explain analyze <query>;

Issues due to join order: Try setting the below two parameter to a lower value and retry the query.

from_collapse_limit=1
join_collapse_limit=1

Verify if there is any processing skew due only a few servers being loaded.

Note: If the steps above do not provide a resolution, it is necessary to tune the query. Example:

Isolate the parts causing higher memory consumption and see if the query can be changed. For example, create temporary tables to stage the data, then execute simple queries. This will reduce the amount of memory required for query processing.
Verify if the join condition in the query uses the same keys as the distribution policy of the table.

3. If there are several queries that failed or the issue could not be reproduced, a more extensive exercise to identify the problem is required.

In most cases, higher concurrency of queries, is the issue. In other words, the number of queries running simultaneously on the system is high which is causing the memory requirements to hit or exceed the gp_vmem_protect_limit. In such cases, follow the the instructions below:

a. Consult with end users to identify if "ad-hoc" queries are allowed on the system or only "scheduled" loads run on the database system. If "ad-hoc" queries run on the system there is a high chance that some users might have triggered a bad query causing the issue. It is hard to predict a timeline of this issue reoccurring. If "Scheduled" loads run on the system, usually the issue will occur around the fixed timeline.

b. Execute the gpmemwatcher and gpmemreport utilities to track the memory consumption of the queries on the segments.

If "ad-hoc" queries run on the system, keep the mem_watcher script running in the background.

If only "scheduled" queries/loads run on the system, run the mem_watcher script for the duration of the batch.

c. Set the value of gp_max_plan_size to avoid queries running out of memory. To start, set it to 200MB. Later review if it needs to be increased or decreased.

gpconfig -c gp_max_plan_size -v "200MB"
gpstop -u
gpconfig -s gp_max_plan_size

Note: The amount of memory required to handle plans is not accounted for by the gp_vmem_protect_limit.

d. Once the issue reoccurs while gpmemwatcher is running in the background, stop gpmemwatcher. Review the data collected with gpmemreport utility.

Based on the output of pgmemreport decide if the memory consumption by the queries is acceptable. Determine if the issue is concurrency or only a few queries are occupying higher memory that need to be evaluated for further tuning.

e. Set the parameter "runaway_detector_activation_percent" to limit the excessive usage of gp_vmem_protect_limit. Cancel queries that are consuming a high amount of memory at runtime. Review these queries to identify the cause of high memory usage. If the query does reach the limit, an error will be produced similar to the one mentioned here.

Please refer to the documentation for more details.

Additional Information

Based on the analysis, there are a few courses of action.

1. There are some badly written queries which require tuning or updated statistics are not there. Apply GPDB best practices to the query, it should run successfully.

2. All queries were tuned, but concurrency is the issue.

In this cases, there are a couple of options that are available.

Reduce query concurrency using a resource queue.
Validate gp_vmem_protect_limit and increase the limit if the servers have the capacity.
Use a session setting to reduce the statement_mem used by specific queries.
Decrease statement_mem at the database level.
Set the memory quota on a resource queue to limit the memory used by queries executed within the resource queue.
Increase memory on the host.
Decrease the number of segments per host in the cluster (Note: Needs planning).

3. If unexpected behaviour is encountered, please open a ticket with our support team.