Problems with connections between DA and DR/Vertica occurring repeatedly:
ERROR | DataAggregator | 2024-06-11T09:53:37,171 | ConnectionPool | .tomcat.jdbc.pool.ConnectionPool 500 | org.apache.tomcat.jdbc | | Unable to create initial connections of pool.
java.sql.SQLNonTransientConnectionException: [Vertica][VJDBC](100176) Failed to connect to host DR2 on port 5433. Reason: Failed to establish a connection to the primary server or any backup address.
Also see errors similar to the following:
ERROR | heduler_Worker-1 | 2024-06-04T02:05:03,836 | ExceptionLog | .ca.im.core.util.ExceptionLogger 104 | m.ca.im.common.core.util | | An existing application exception RECURRED (Key=9afb25a97eae6f0ca1cde7e7da2851fac158ae39), Recurrence count=14 : Failed to drop partition(s) for table: ifstats_rate, min key: 1712880000, max key: 1712880000 : StatementCallback; uncategorized SQLException for SQL [SELECT DROP_PARTITIONS('ifstats_rate','1712880000','1712880000',true)]; SQL state [55V03]; error code [5157]; [Vertica][VJDBC](5157) ERROR: Unavailable: [Txn 0xb000000254caaa] O lock table - timeout error Timed out O locking Table:dragg.ifstats_rate. I held by [user dragg (COPY ifstats_rate (dcm_id, pollgroup_id, item_id, rollup_type, tstamp, dto_sequence_id, rinterval, duration, thresh_duration, thresh_count, im_IPSecEncodingFailures, im_SAPolicyDenialPacketsDropped, im_FramesOut, std_im_FramesOut, im_Bytes, im_PctDiscardsIn, std_im_PctDiscardsIn, im_FrameErrors, im_Discards, std_im_Discards, im_AvgInboundOctetRateforInterface, im_CellErrorRatio, im_NoAssocSAPolicyPacketsDr
and
ERROR | c53-fb2bc9f6adef | 2024-06-11T09:40:14,706 | RunnableTimedExecution | oncurrent.RunnableTimedExecution 101 | emini.blueprint.extender | | Closing runnable for context OsgiBundleXmlApplicationContext(bundle=com.ca.im.data-mgmt.common, config=osgibundle:/META-INF/spring/*.xml) did not finish in 10000ms; consider taking a snapshot and then shutdown the VM in case the thread still hangs
This results in data gaps in some dashboards as the related poll data is unable to be processed:
DX NetOps CAPM all currently supported releases
DR likely has Inadequate resources and is not set to fault tolerance, in which case the following is seen in the vertica.log
2024-07-12 08:00:00.132 Init Session:0x7fe339fd0700-a0000012e32772 <WARNING> @v_drdata_node0001: V1002/2957: Current system KSAFE level is not fault tolerant
This shows there are only 2 nodes in the system.
TechDocs : DX NetOps CAPM 23.3 : Add a Node to the Data Repository Cluster
There is no Fail-over (K-safety) with a 2-node DR, as per the following settings:
1 or 2 node/s (0 k-safety)
, this means there is no copy of data 3 or 4 nodes (1 k-safety)
, this means there is 1 copy of data5 + nodes (2 k-safety)
, this means there are 2 copies of data
So even with 2 nodes, if one goes down, there is no possibility of recovery, since there is no copy of the data and you can't even switch to the other one manually.
Also, commands may not be completing fast enough to release the locks shared between queries/tasks.
Adding disk space only relieves space needed to store data and temporary work that can't be done in memory.
Adding CPUs (or cores) doesn't speed or really allow more work to be done. They all share the same memory and disk resource which are both effected by the number of things using it (such as items, threshold calls etc..). So adding CPUs just puts a bigger load on the limited resources.
You can run the following in Vertica (/opt/vertica/bin/
adminTools -> Connect to DB
):
SELECT node_name, storage_path, storage_status,storage_usage, DISK_SPACE_FREE_MB, disk_space_free_percent from disk_storage;
This will show the disk usage of the data & catalog on the partitions. Vertica recommends that you always have at least 40% free disk space, which is used when performing temporary operations such as loads, deletes, etc. It also allows for an appropriate amount of free disk space needed if a spill-to-disk operation occurs during the execution of a given query due to running out of available RAM.