search cancel

data_engine 9.02HF3 intermittently raising "Bulk load " alarms

book

Article ID: 134195

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

data_engine 9.02HF3 is intermittently raising the following alarm which is cleared 20 to 30 seconds later: "Bulk load data was expected but not sent. The batch will be terminated." At the same time, the data_engine memory consumption is slowly increasing over a period of a few weeks until the Primary robot has to be restarted.

Cause

- data_engine message processing and configuration

Environment

Release : 9.0.2HF3

Component : UIM - DATA_ENGINE

Resolution

The data_engine may generate an alarm that says "Insert bulk failed due to a schema change of the target table." This is most likely to happen during the housekeeping procedures which, among other things, are reindexing the tables at the same time the data_engine is bulk-inserting new records into the tables. For example, the bulk insert contains a row for a insert on a locked object (being reindexed). The data_engine detects this, and saves all of the existing bulk insert packages. It then reinitializes the connection to the database (to make sure there isn't a problem), then inserts the saved bulk packages, and continues processing the data. So no data are being lost in this process. 


The data_engine may generate an alarm that says ""Bulk load data was expected but not sent. The batch will be terminated." This alarm is coming from the ODBC driver that the probe uses when attempting to create a rowset which is blocked for similar reasons. When the data_engine generates either of these alarms, the commit fails and the data_engine restores the records back to the unprocessed queue. These records will be processed in the next cycle. This is behavior by design, and has been implemented to catch the several hundred error conditions which can be generated from the ADO/Native Client layer, if something goes wrong


This alarm can also occur even if automatic reindexing is disabled in the data_engine probe's configuration if partitioning is enabled. The data_engine maintenance job creates new partitions and re-indexing will be performed during partition creation. By default, these alarms are issued with severity = critical. There is an option to set the alarm_severity flag in the raw-configure module of the data_engine to raise these alarms with a different alarm level (e.g., Informational), if that's desirable. But the drawback is of course that you might get into a situation where the ADO/Native Client layer generates a generic error message for which you might want to give attention. Unfortunately we're unable to control the messages provided by these layers, so we just present them when they come.


In general, as long as you don't see any gaps in your data, you can choose to ignore these alarms and setup a nas pre-processing rule to exclude them.


In the data_engine if you're done with any debugging,


Set the table_maintenance_loglevel = 0 (not 5) when you're not debugging the DE. The log still gets written to but bears less load on the DE.


loglevel =3 (don't need 5 when not debugging for support)


Check queue_limit_total = 100000 (is the default)


The data_engine tends to commit based on 1 of two factors; time or volume. The raw configure value "queue_limit_total" controls this rate. So when we hit 100,000 we dump the contents of memory and we log an entry. In the data_engine.log check to see if you see messages similar to this where its surpassing 100k in the bulk buffers...

de: qos_data_thread - messages on hub: 7, messages in data_engine: 108,178 ( bulk buffers: 108,178, process queue: 0, limit: 100,000 )  If so try setting the data_engine 'queue_limit_total' to 90000.  


The queue_limit_total defines how many messages can be queued internally in data_engine (waiting for insert) before data_engine goes back to the hub for more messages. This value is used to throttle data after enabling bulk commits via thread_count_insert. 100000 is the recommended value but this can be adjusted if necessary. Setting it to 50000, for example, will cause data_engine to commit data to the database more frequently than 100000 and keep less messages in its memory buffer.


Also check to see how many core/virtual processors you have for the Primary and ask support to help you assess if you need more or discuss this with your virtualization admin to see if it makes sense to add more to enhance data throughput.


Make sure your Primary hub is configured with these settings:


hub.cfg 

<hub> section: 

check_spooler_sessions = 1 

<tunnel> section: 

protocol_mode = 3 

max_heartbeat = 30 

postroute_interval = 120 

postroute_reply_timeout = 300 

postroute_passive_timeout = 300 

hub_request_timeout = 120 

tunnel_hang_timeout = 300 

tunnel_hang_retries = 3 

If you have an HA node, don't forget to set the HA hub up with the identical settings. Also set: cache_remote_nametoip=no 

on the HA node.

Changes in the robot.cfg: 

<controller> section: 

reuse_async_session=1

Additional Information

a) For "Bulk load" alarms:  see KB Article 34364 : What is the Meaning of the alarm: Insert bulk failed due to a schema change of the target table


b)  For the "Free database memory (May 14 09:57:00:645 [30996] 1 de: Monitor - prdNimsoftUIM has -1 MB free space and disk S: has 486616 MB free space ) " message logged intermittently in the data_engine.log file: Defect #DE417760 The probe is improperly handling disk space check exception when there is a deadlock in the database. The error does not raise any alarms and does not affect the performance of the data_engine probe. It will be addressed in a future release of the data_engine probe.