Difference in Diego Cell memory usage reported between Ops Manager and cfdot
search cancel

Difference in Diego Cell memory usage reported between Ops Manager and cfdot

book

Article ID: 293859

calendar_today

Updated On:

Products

Operations Manager

Issue/Introduction

Sometimes the VM status in Operations Manager (Ops Manager), a single or multiple Diego cell(s) may hit a memory usage of 85% and above, while others reach no more than 20%. 

However, other metrics show the loaded cells are hosting 6 Java VMs (JVMs) with an average size of 2 GB, and cfdot reports that 48 GB of the 64 GB Diego cell is free. 

The summed JVM size of the JVMs is about 8GB. 

Why does the Diego Cell report 85% utilization? 

A ps -ef on the Diego Cell provides no insight into what is causing the problem. We have 16 Diego cells, which all host about 6 JVMs but report over 70% utilization. 

The remaining 40 Diego Cells report less than 25% utilization. 

At the moment, we use the memory value in Ops Manager to check if we need more or less Diego Cells. In other words, it is critical to manage our Diego Cell estate. At the moment this value seems not to reflect our understanding.

Environment

Product Version: 2.10

Resolution

Note: The main purpose of using Ops Manager VM status metrics is for observation and troubleshooting, it is not intended for scaling the environment.   

For information about how to identify if there is memory contention indications, refer to Key Performance Indicators.

To further analyze why the memory consumption, follow these steps and commands.

1. Run this command to calculate the used memory in GBs from all running processes:

ps aux | awk 'BEGIN {sum=0} {sum +=$6} END {print sum/1024/1024}'


2. To get the used memory for all Diego Cells, run this command:

bosh -d cf-<ID> ssh diego_cell -c 'ps aux | awk '\''BEGIN {sum=0} {sum +=$6} END {print sum/1024/1024}'\''' |  grep -vE 'Unauthorized use is strictly prohibited|is subject to logging and monitoring|Connection to.*closed'


3. The output of these commands provides the sum of used memory by all running processes. Check if used memory is different from the output of this command:

free -g

cat /proc/meminfo


If these two values are different, then it is possible that some memory might be used as cache. For vSphere environments ballooning might be in place to verify the value of ballooning on all Diego Cells:

bosh -d cf-ID ssh diego_cell -c 'vmware-toolbox-cmd stat balloon' |  grep -vE 'Unauthorized use is strictly prohibited|is subject to logging and monitoring|Connection to.*closed'


4. To verify which process is utilizing the most memory, run this command:

ps aux --sort=-%mem | head


5. For very detailed information about processes and threads, run the following commands to verify the CPU and memory usage of the threads:

ps aux 
ps axjfww  
ps -o pid,user,%mem,command ax | sort -b -k3 -r

pmap <PID>
pmap <PID> | tail -n 1

Now you can also list how much memory is used by multiple processes using their PIDs with pmap as follows:

sudo pmap <PID1> <PID2> | grep total

 


Ballooning

For Diego Cells, the main purpose of them it to run requested applications pushed with Cloud Foundry (CF).

Without these App processes, the overall memory consumption should be close to only the minimum. However, if the hypervisor needs to reclaim memory due to contention, the IaaS might reclaim the memory by using ballooning method.

Ballooning is a driver, which when triggered by ESXi host memory utilization above 90%, starts requesting memory from the system up to 65% of Memory capacity. When needed the balloon will "POP" releasing the memory and allowing the Hypervisor to reclaim it. 

This process is not triggered unless memory contention event is present. Which can leave signs of memory usage even if there is no real process using this memory.