Errors shown in DA karaf.log due to Vertica performance issues resulting in data gaps in Dashboards/reports

Products

DX NetOps CA Performance Management - Usage and Administration

Issue/Introduction

Problems with connections between DA and DR/Vertica occurring repeatedly:

java.sql.SQLNonTransientConnectionException: [Vertica][VJDBC](100176) Failed to connect to host DR2 on port 5433. Reason: Failed to establish a connection to the primary server or any backup address.

Also see errors similar to the following:

ERROR | heduler_Worker-1 | 2024-06-04T02:05:03,836 | ExceptionLog | .ca.im.core.util.ExceptionLogger 104 | m.ca.im.common.core.util | | An existing application exception RECURRED (Key=9afb25a97eae6f0ca1cde7e7da2851fac158ae39), Recurrence count=14 : Failed to drop partition(s) for table: ifstats_rate, min key: 1712880000, max key: 1712880000 : StatementCallback; uncategorized SQLException for SQL [SELECT DROP_PARTITIONS('ifstats_rate','1712880000','1712880000',true)]; SQL state [55V03]; error code [5157]; [Vertica][VJDBC](5157) ERROR: Unavailable: [Txn 0xb000000254caaa] O lock table - timeout error Timed out O locking Table:dragg.ifstats_rate. I held by [user dragg (COPY ifstats_rate (dcm_id, pollgroup_id, item_id, rollup_type, tstamp, dto_sequence_id, rinterval, duration, thresh_duration, thresh_count, im_IPSecEncodingFailures, im_SAPolicyDenialPacketsDropped, im_FramesOut, std_im_FramesOut, im_Bytes, im_PctDiscardsIn, std_im_PctDiscardsIn, im_FrameErrors, im_Discards, std_im_Discards, im_AvgInboundOctetRateforInterface, im_CellErrorRatio, im_NoAssocSAPolicyPacketsDr

and

ERROR | c53-fb2bc9f6adef | 2024-06-11T09:40:14,706 | RunnableTimedExecution | oncurrent.RunnableTimedExecution 101 | emini.blueprint.extender | | Closing runnable for context OsgiBundleXmlApplicationContext(bundle=com.ca.im.data-mgmt.common, config=osgibundle:/META-INF/spring/*.xml) did not finish in 10000ms; consider taking a snapshot and then shutdown the VM in case the thread still hangs

This results in data gaps in some dashboards as the related poll data is unable to be processed:

Environment

DX NetOps CAPM all currently supported releases

Cause

DR likely has Inadequate resources and is not set to fault tolerance, in which case the following is seen in the vertica.log

2024-07-12 08:00:00.132 Init Session:0x7fe339fd0700-a0000012e32772 <WARNING> @v_drdata_node0001: V1002/2957: Current system KSAFE level is not fault tolerant

This shows there are only 2 nodes in the system.

Resolution

If 1 node is having issues, that puts a greater load on the 2nd node to complete the work, and any work tagged for the problem node will be slower possibly.

We recommend running a 3-node system rather than 2. The TechDocs covers expanding vertica to a 3-node system:

TechDocs : DX NetOps CAPM 23.3 : Add a Node to the Data Repository Cluster

There is no Fail-over (K-safety) with a 2-node DR, as per the following settings:

1 or 2 node/s (0 k-safety), this means there is no copy of data
3 or 4 nodes (1 k-safety), this means there is 1 copy of data
5 + nodes (2 k-safety), this means there are 2 copies of data

So even with 2 nodes, if one goes down, there is no possibility of recovery, since there is no copy of the data and you can't even switch to the other one manually.

Also, commands may not be completing fast enough to release the locks shared between queries/tasks.

Adding disk space only relieves space needed to store data and temporary work that can't be done in memory.

Adding CPUs (or cores) doesn't speed or really allow more work to be done. They all share the same memory and disk resource which are both effected by the number of things using it (such as items, threshold calls etc..). So adding CPUs just puts a bigger load on the limited resources.

Hence, it's better to replace them with much faster CPUs, since compute fastest > compute many at once. The faster the CPUs, the quicker Vertica completes work.

Additional Information

You can run the following in Vertica (/opt/vertica/bin/adminTools -> Connect to DB):

SELECT node_name, storage_path, storage_status,storage_usage, DISK_SPACE_FREE_MB, disk_space_free_percent from disk_storage;

This will show the disk usage of the data & catalog on the partitions. Vertica recommends that you always have at least 40% free disk space, which is used when performing temporary operations such as loads, deletes, etc. It also allows for an appropriate amount of free disk space needed if a spill-to-disk operation occurs during the execution of a given query due to running out of available RAM.