Pods stuck in "container creating" and "init:0/1 "state after the Datastore went out of space.

search cancel

Pods stuck in "container creating" and "init:0/1 "state after the Datastore went out of space.

book

Article ID: 399193

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

The pods are stuck in "container creating" or "Init:0/1" state.
You see continuous "Detach a virtual disk" tasks failing with "Database temporarily unavailable or has network problems" error in the "Related Tasks" section of the vSphere UI. The disks being attempted to detach are the FCDs' provisioned as a result of the PV/PVC provisioning on the Guest cluster Virtual Machines.
In the csi-attacher logs of the CSI-Controller, you see the volume attachment object failing to detach because of the database being temporarily available. The relevant log snippet can be seen below.

Fault: (*types.DatabaseError)(0xc000120520)({
RuntimeFault: (types.RuntimeFault) {
MethodFault: (types.MethodFault) {
FaultCause: (*types.LocalizedMethodFault)(<nil>),
FaultMessage: ([]types.LocalizableMessage) <nil>
}
}
}),
LocalizedMessage: (string) (len=57) "Database temporarily unavailable or has network problems."})

/var/log/vmware/vpxd/vpxd.log of the vCenter server confirms the same issue.

error vpxd[] [Originator@6876 sub=Default opID=<op-ID>] [VpxLRO] -- ERROR task-<ID>) -- vm-<ID> -- vim.VirtualMachine.detachDisk: :vim.fault.DatabaseError
--> Result:
--> (vim.fault.DatabaseError) {
--> faultCause = (vmodl.MethodFault) null,
--> faultMessage = <unset>
--> msg = "Received SOAP response fault from [<<io_obj p:0x00007effe843cfe8, h:52, <UNIX ''>, <UNIX '/var/run/envoy-hgw/hgw-pipe'>>, /hgw/host-11012/vpxa>]: retrieveVStorageObjectPathAndCrypto
--> Received SOAP response fault from [<<io_obj p:0x00000043cfb9b008, h:19, <TCP '127.0.0.1 : 40647'>, <TCP '127.0.0.1 : 8307'>>, /sdk>]: retrieveVStorageObjectPathAndCrypto
--> Database temporarily unavailable or has network problems."
--> }
--> Args:
-->
--> Arg diskId:

Per /var/run/log/hostd.log, the tidy file is corrupt and no longer matches the correct version.

In() Hostd[]: [Originator@6876 sub=Libs opID=<ID> sid=<ID> user=vpxuser:VSPHERE.LOCAL\vpxd-extension-<ID>] FCDLIB: fcd-catalog: Catalog::Get start
In() Hostd[]: [Originator@6876 sub=Default opID=<ID> sid=<ID> user=vpxuser:VSPHERE.LOCAL\vpxd-extension-<ID>] Transfer to exception eraro code: 601, message: Wrong tidy version: mTidyVersion = 1, header.mVersion = 0
In() Hostd[]: [Originator@6876 sub=AdapterServer opID=<ID> sid=<ID> user=vpxuser:VSPHERE.LOCAL\vpxd-extension-<ID>] AdapterServer caught exception; <<521a062e-7629-ff27-4448-1ea96d18ebb9, <TCP '127.0.0.1 : 8307'>, <TCP '127.0.0.1 : 20973'>>, ha-vstorage-object-manager, vim.vslm.host.VStorageObjectManager.retrieveVStorageObjectPathAndCrypto>,N3Vim5Fault13DatabaseError9ExceptionE(Fault cause: vim.fault.DatabaseError

Environment

vSphere with Tanzu
VMware vCenter server
VMware vSphere ESXi

Cause

The tidy file of the datastore can go corrupt in case of the space constraints. In some cases, even if enough free space is added on the datastore, the tidy file isn't recovered automatically.

Resolution

To fix the issue, the FCD Catalog needs to be rebuilt for the affected datastore. The detailed steps to perform the same can be found in the kb article- https://knowledge.broadcom.com/external/article/320790/persistent-volume-is-failing-to-attach-w.html

Please ensure that till the catalog rebuild is complete, no FCD related operations are taking place on the affected datastore.

Feedback

thumb_up Yes

thumb_down No