Vertica is down and when we restart it, it stops again in a few moments.

book

Article ID: 103269

calendar_today

Updated On:

Products

CA Infrastructure Management CA Infrastructure Management CA Performance Management - Usage and Administration CA Performance Management - Data Polling

Issue/Introduction

Vertica went down last night.
Each time we restart vertica it fails again within a few moments.
 

Cause

In Vertica.log we see one node leaves the cluster, then another node leaves the cluster, then we shut down for k-safety.
And in dmesg we see OS level errors on the interfaces:
some sample lines
 
[ 7303.350454] bond0: Removing slave eth0
[ 7303.350523] bond0: Releasing backup interface eth0
[ 7303.350526] bond0: the permanent HWaddr of eth0 - **:f2:e9:bd:9c:70 - is still in use by bond0 - set the HWaddr of eth0 to a different address to avoid conflicts
[ 7303.507399] bond0: Removing slave eth1
[ 7303.507457] bond0: Removing an active aggregator
[ 7303.507459] bond0: Releasing backup interface eth1
[ 7303.944889] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 7303.972597] bond0: Adding slave eth0
[ 7303.972661] tg3 0000:16:00.0: irq 53 for MSI/MSI-X
[ 7303.972665] tg3 0000:16:00.0: irq 54 for MSI/MSI-X
[ 7303.972669] tg3 0000:16:00.0: irq 55 for MSI/MSI-X
[ 7303.972673] tg3 0000:16:00.0: irq 56 for MSI/MSI-X
[ 7303.972677] tg3 0000:16:00.0: irq 57 for MSI/MSI-X
[ 7304.087939] bond0: Enslaving eth0 as a backup interface with a down link
[ 7304.114156] bond0: Adding slave eth1
[ 7304.114213] tg3 0000:16:00.1: irq 58 for MSI/MSI-X
[ 7304.114217] tg3 0000:16:00.1: irq 59 for MSI/MSI-X
[ 7304.114221] tg3 0000:16:00.1: irq 60 for MSI/MSI-X
[ 7304.114225] tg3 0000:16:00.1: irq 61 for MSI/MSI-X
[ 7304.114229] tg3 0000:16:00.1: irq 62 for MSI/MSI-X
[ 7304.229236] bond0: Enslaving eth1 as a backup interface with a down link
[ 7304.234371] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
[ 7316.890718] tg3 0000:16:00.0 eth0: Link is up at 1000 Mbps, full duplex
[ 7316.890724] tg3 0000:16:00.0 eth0: Flow control is off for TX and off for RX
[ 7316.890727] tg3 0000:16:00.0 eth0: EEE is disabled
[ 7316.933455] bond0: link status definitely up for interface eth0, 1000 Mbps full duplex
[ 7316.933462] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 7316.933472] bond0: first active interface up!
[ 7316.933507] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
[ 7317.042805] tg3 0000:16:00.1 eth1: Link is up at 1000 Mbps, full duplex
[ 7317.042811] tg3 0000:16:00.1 eth1: Flow control is off for TX and off for RX
 

Environment

CAPM 3.5
Vertica 8.1.0-4
3 node Vertica cluster on Linux

 

Resolution

Issue was with Cisco FabricPath switch and Server, not communicating correctly using LACP. The port channel (lacp) definition on the Cisco switch had to removed and re-added to once again enable switch to server communication.