Queries failed due to the following error message:
"VM protect failed to allocate %d bytes from the system, VM Protect %d MB available"
What does the error message mean and where can I check and understand the reason for the error?
What does "VM protect failed to allocate %d bytes from the system, VM Protect %d MB available" mean?
The amount of virtual memory used by Greenplum queries are less than the limit defined by gp_vmem_protect_limit
.
The error occurs when an attempt is made to allocate memory for a query but the operating system falls short of allocating memory even though after allocation, the memory used by Greenplum will be still below the limit defined by gp_vmem_protect_limit
.
This suggests that the system does not have enough memory for Greenplum queries. This may be due to misconfiguration, high memory usage of the servers by the non-GPDB processes, or some unaccounted memory consuming process.
Because the system runs off of the memory, the database segment will experience a PANIC and will go into recovery mode to recover the failed transaction. This will cause all of the existing queries to fail.
Are there statements or executions that can't be controlled via the resource queue, gp_vmem_protect_limit
, or any other parameters?
Yes, there are certain memory allocations made for a query execution that are not controlled by resource queues, gp_vmem_protect_limit
, or statement_mem
.This can consume memory directly from the OS. These memory allocations will primarily include memory used to hold the generated plan which will be dispatched for the query
To find the amount of memory consumption due to query plan:
In order to view the query plan size, you can set gp_log_gang
GUC
to debug and review the logs:
warehouse=# set gp_log_gang=debug; SET warehouse=# explain analyze select * from abc t1, abc t2 where t1.a=t2.a;
After you run the query above, the logs will include an entry similar to the one below:
2015-03-11 06:03:20.288583 PDT,"bhuvnesh","warehouse",p5919,th-1598946860,"[local]",,2015-03-11 06:03:07 PDT,13258,con14,cmd3,seg-1,,dx52,x13258,sx1,"LOG","00000","Query plan size to dispatch: 3KB",,,,,,"explain analyze select * from abc t1, abc t2 where t1.a=t2.a ;",0,,"cdbdisp.c",4013,
gp_max_plan_size
. If there is any query whose uncompressed plan size will exceed the gp_max_plan_size
, this query will not be dispatched to the segments or executed and an error will be reported similar to the error below:warehouse=# set gp_max_plan_size='2kB'; SET warehouse=# explain analyze select * from abc t1, abc t2 where t1.a=t2.a; ERROR: Query plan size limit exceeded, current size: 3KB, max allowed size: 2KB HINT: Size controlled by gp_max_plan_size
Depending on the average size of the plans used by the queries on the database, you can set a value of gp_max_plan_size
to identify any potential queries whose requirement for memory can have a heavy impact due to plan size.
Keep this aspect in mind during analysis. It is often a good idea to isolate if the OOM is not due to plan size.
Some basic troubleshooting steps
There are several factors which can result in OS OOM issues.
Refer to the list below for a series of checks that need to be done while troubleshooting.
1. Validate the value of gp_vmem_protect_limit
. The gp_vmem_protect_limit
maximum limit must be calculated based on the calculations provided in Greenplum Documentation.
Example:
(SWAP + (RAM * vm.overcommit_ratio)) .9 / primary number_segments_per_server
Please note that in the above example, a multiplication factor of 0.9 is used, which leaves 10% of virtual memory available for other processes (like the kernel, segment mirrors etc.).
However, if there are several non-GPDB processes, for example java programs or other memory consuming processes, you must reduce 0.9 to a lower value to allow more memory for other processes.
2. If the number of active primaries on a segment was higher than the default number of primaries and gp_vmem_protect_limit
is configured using a multiplication factor of 0.9, the segment server will be limited in memory and might not be able to allocate the required gp_vmem_protect_limit
to the segments due to the temporary activation of some primaries (for example, when some of the primaries failed and mirrors are now acting as primaries on the server).
You can verify the gp_segment_configuration
and the gp_configuration_history
to review the status of segments around the time of OOM.
select * from gp_segment_configuration where role<>preferred_role;
In this case, OOM will be expected and the immediate resolution will be to recover the segments and bring the segments to their preferred role.
If the servers have high memory modules such as 126 GB, 256 GB, 512 MB RAM or more, you can revisit the value of gp_vmem_protect_limit
to accommodate these situations.
3. Verify whether there are other non-GPDB processes running on the master or segment server. There may be many memory consuming processes (java processes) which might have taken up the memory, causing GPDB queries to fail. However, you will not be able to comment if there was anything else running while the OOM failures were observed because these are not recorded in logs. The users who administer the system may have information regarding the processes that were running, otherwise monitoring needs to be scheduled.
You can use ps/top
commands to verify that the process is occupying high memory.
An example command is listed below:
ps -eo pid,pmem,rss,vsz,cmd --sort=vsz top (then) F (then) > q (then) -> Enter (It will sort the output based on RSS memory utilization in descending order)
4. Using the sar -r
output, verify the amount of memory free around the time that OOM was reported. The samples for sar
memory are taken every 10 minutes so this information may not have been captured if the memory consumption was momentarily high in between the samples.
sar -r -f /var/log/sa/saXX
(where XX
is the date, by default sar
creates a file under /var/log/sa
using the date ex. sar05
, sar06
, etc. You may be able to see files for the last 1 month.
The output will be similar to the following:
12:00:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 12:10:01 AM 33650484 32235996 48.93 207844 25584116 11850416 11.92
In the output above, kbmemfree
is the amount of memory free in KBs at 12.10.01 AM. This output does not consider SWAP space.
Based on the analyses, the following paths may have possibly occurred:
gp_vmem_protect_limit
needs to be revisited.mem_watcher
, gp_vmem_protect_limit
is not respected due to unaccounted processes during query processing that were causing OS memory to be consumed, resulting in OOM. In this case, the issue should be reported to support.