A query fails, producing the error message: "VM Protect failed to allocate %d bytes, %d MB available".
This document explains the error message and some basic troubleshooting steps that can be taken.
The gp_vmem_protect_limit parameter limits the amount of memory that can be allocated on the segments to queries. If the attempt to allocate memory to a query causes the gp_vmem_protect_limit to be exceed, the query will be canceled.
Note: The error message will be recorded in the database logs.
This is a self defence mechanism introduced to avoid over-consumption of the memory by Greenplum Database (GPDB) queries.
Note: Always validate that the tables are vacuumed and analyzed.
There are multiple reasons queries running on a system produce such an error. First try to determine if the issue is or is not attributed to concurrency.
1. Analyze the master and segment logs for the necessary time window to isolate if the issue is query specific or most of the queries failed with the same error message.
egrep "Out of memory|VM Protect" gpdb-YYYY-MM-DD_nnnnnn.log
2. If there is only a limited number of queries which failed with the error message, identify and analyze the queries. Execute the queries on the database to identify the error is reproducible.
If the issue is reproducible, proceed further to extract answers to the below questions.
Below is the list of checks to identify why the query is consuming such a high amount of memory.
a. Verify that all the tables used in the query are analyzed.
b. Review the explain plan of the query to verify any abnormalities, specifically:
Nested Loops: If the explain plan suggest that the query is using Nested Loops, try disabling Nested Loops to force a Hash join.
set enable_nestloop=off ; explain analyze <query>;
from_collapse_limit=1 join_collapse_limit=1
3. If there are several queries that failed or the issue could not be reproduced, a more extensive exercise to identify the problem is required.
In most cases, higher concurrency of queries, is the issue. In other words, the number of queries running simultaneously on the system is high which is causing the memory requirements to hit or exceed the gp_vmem_protect_limit. In such cases, follow the the instructions below:
a. Consult with end users to identify if "ad-hoc" queries are allowed on the system or only "scheduled" loads run on the database system. If "ad-hoc" queries run on the system there is a high chance that some users might have triggered a bad query causing the issue. It is hard to predict a timeline of this issue reoccurring. If "Scheduled" loads run on the system, usually the issue will occur around the fixed timeline.
b. Execute the gpmemwatcher and gpmemreport utilities to track the memory consumption of the queries on the segments.
c. Set the value of gp_max_plan_size to avoid queries running out of memory. To start, set it to 200MB. Later review if it needs to be increased or decreased.
gpconfig -c gp_max_plan_size -v "200MB" gpstop -u gpconfig -s gp_max_plan_size
Note: The amount of memory required to handle plans is not accounted for by the gp_vmem_protect_limit.
d. Once the issue reoccurs while gpmemwatcher is running in the background, stop gpmemwatcher. Review the data collected with gpmemreport utility.
Based on the output of pgmemreport decide if the memory consumption by the queries is acceptable. Determine if the issue is concurrency or only a few queries are occupying higher memory that need to be evaluated for further tuning.
e. Set the parameter "runaway_detector_activation_percent" to limit the excessive usage of gp_vmem_protect_limit. Cancel queries that are consuming a high amount of memory at runtime. Review these queries to identify the cause of high memory usage. If the query does reach the limit, an error will be produced similar to the one mentioned here.
Please refer to the documentation for more details.
Based on the analysis, there are a few courses of action.
1. There are some badly written queries which require tuning or updated statistics are not there. Apply GPDB best practices to the query, it should run successfully.
2. All queries were tuned, but concurrency is the issue.
In this cases, there are a couple of options that are available.
3. If unexpected behaviour is encountered, please open a ticket with our support team.