Examples of performance issues are, but not limited to:
- Needing to restart CDP frequently for any reason
- High resource utilization – Memory / CPU / Disk IO
- System performance is slow (slow response time)
- System outage (CDP shuts down unexpectedly)
Data to collect:
- Infrastructure and Environment configuration
- Network Topology (Number of forward proxies, number of reverse proxies, load balancers, etc.)
- Services running on each node (Container, MTA, MC, Cluster Manager)
- Confirm that they are running one node with MC and container only
- Confirm they are running one reverse proxy on a separate node
- Server roles for containers: Traffic handling / generic / background service
- Get the arguments files and properties files all services
Management Console screenshots:
- Proxy module configuration page (Server and Client tabs)
- Instance configuration page (take multiple screenshots to capture whole page)
- Search Engine page
- Job Management page
JMX Data: Make sure JMX Monitoring script is running continuously in the background and capturing data properly. Provide the csv data file.
- Confirm script is running and collecting data by doing a “tail -f <hostname>.txt”. Make sure it is writing data every minute and the data looks good (not zeroes, not JMXNOK).
- Provide the <hostname>.txt
- If not already present, add the following line to the container arguments file: -XX:+HeapDumpOnOutOfMemoryError and restart the container
Logs: Zip up entire logs directory on each node, containing container, access, cassandra, and console logs
- Output of top command
- Output of free -h
- Output of ulimit -a
- Thread Dump: Create a thread dump by issuing the following command: kill -3 <pid of container>
- Create a thread dump every 4 hours for 24 hours, and during a system performance event, if possible
Heap Dumps: There are two types of heap dumps – one that contains everything in the heap currently, and one that runs garbage collection first and then produces a heap dump. Both of them are useful
- Command for normal heap dump: jmap -dump:format=b,file=container.hprof <pid of container>
- Command for garbage collected heap dump: jmap -dump:live,format=b,file=container_live.hprof <pid of container>
- Please run both commands during normal operations to get a baseline, and then try and run both commands during a system performance event, if possible
- The “HeapDumpOnOutOfMemoryError” argument automatically creates a heap dump on out of memory error. Grab this as well.
- These memory dumps will be called something like “java_pid3756.hprof” located in the container folder. Any new ones created will have a recent timestamp.
- Make sure there is enough disk space for all of this
As for memory configurations, the only change that we want to recommend is to reduce Cassandra from taking up 8 GB to 4 GB of memory. Here is how to do that:
- In the cassandra/conf/cassandra-env.sh file there are two options, commented out by default:
#MAX_HEAP_SIZE="4G"
#HEAP_NEWSIZE="800M"
- Uncomment these and change them to the following:
MAX_HEAP_SIZE=”4G”
HEAP_NEWSIZE=”1G”
- Restart Cassandra. Cassandra could take up to 2 hours to start, so this should be done during a maintenance window