The following errors from the Data Aggregator (DA) karaf.log file found in (default path) /opt/IMDataAggregator/apache-karaf-2.4.3/data/log.
The first will be a RIB Query Exception from Vertica that states:
We've seen in-house where the speed of the disks can lead to stability issues with Vertica. When I/O is low, the following can be done to help minimize Vertica crashes and some resource related errors during normal product operation.
The concept is dirty pages. Linux uses memory to cache file writes. It uses a Page Cache to store these writes before writing them to disk. Each page is called a dirty page (because it hasn't been written yet).
There are settings for when it will write this data to disk and when the cache is too full and needs to be flushed to disk. If the max is reached, all new writes are halted until the cache is flushed.
Depending on the size of the cache and speed of the disks, this can take some time. This is where we believe the issue is that is causing Vertica to crash.
If Linux is too busy dumping dirty pages to disk, and it takes more than 2 minutes, Vertica could crash.
There is one Linux setting called: vm.dirty_ratio. It is the ratio of Page Cache in memory compared to overall memory. When this ratio is reached, Linux halts all new write requests and flushes the Page Cache to disk. OOTB, this value is 20 (20% of overall memory).
There is another Linux setting called: vm.dirty_background_ratio. It is the ratio of the Page Cache at which it will start writing the data to disk. OOTB, this value is 10 (10% of overall memory).
All supported Performance Management releases
Optimal setting is for the Page Cache to be dumped from full to 0 in 30 seconds as to not hold up pending writes for more than 30 seconds.
To determine the best value for these settings the vioperf script needs to be run against the DR DB data disks. From that data:
This may not fully resolve the appearance of the errors without improving I/O for the data and catalog disks used by the Data Repository.