Datastore usage and total size of PVs on the datastore showing different utilisation

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The Datastore space utilisation KB shows that X amount of space is consumed on the datastore.

However from a kubernetes perspective, the total sum of the persistent volumes is Y.

This KB will help identify the source of the discrepancy.

Environment

Tanzu Kubernetes cluster with CSI PVs backed by NFS datastore

Cause

Check backup solution

If there are any backup solutions, confirm they are working as expected and that there are no stale backups.

For velero backup

velero backup get

velero backup describe <backup name>

velero backup logs <backup name>

Check for any stale CSI snapshots

kubectl get volumesnapshot -A

kubectl get volumesnapshotcontent -A

Check disk usage on ESX Host

Identify the datastore usage and UUID, this can be retrieved from vSphere UI or using esxcli or esxcfg-info

esxcli storage filesystem list

Mount Point Volume Name UUID Mounted Type Size Free
------------------------------- ------------------ ----------------------------------- ------- ------ -------------- ----

/vmfs/volumes/abcs1234-abcd1234 datastore_1 abcd1234-abcd1234 true NFS 38482906972160 31134093824000

For this datastore, Usage = Size - Free 38482906972160 - 31134093824000 => 7348813148160

Converted to TB:7348813148160/(1024 * 1024 * 1024 * 1024) => 6.68 TB

esxcfg-info -a | grep -A15 <UUID>

|----Volume UUID.....................................abcd1234-abcd1234
|----Volume Name.....................................datastore_1
|----LVM Name........................................10.##.##.## /datastore_1
|----Type............................................NFS
|----Head Extent.....................................nfs:abcd1234-abcd1234
|----Console Path..................................../vmfs/volumes/abcd1234-abcd1234
|----Block Size......................................4096
|----Total Blocks....................................9395240960
|----Logical Disk Block Size.........................512
|----Physical Disk Block Size........................512
|----isSw512e........................................false
|----Blocks Used.....................................1793955619
|----Size............................................38482906972160
|----Usage...........................................7348042215424

Check disk usage for datastore directory on ESX Host

Check disk usage

du -sh /vmfs/volumes/<UUID>/

du -sh /vmfs/volumes/abcd1234-abcd1234/

3.6T /vmfs/volumes/abcd1234-abcd1234/

Check size of all files on datastore

ls -Risla /vmfs/volumes/abcd1234-abcd1234/ > ls-datastore.txt

cat ls-datastore.txt | grep total | awk '{print $2}' | awk '{ sum += $1 } END { print sum }'

This returns 3908174964 KB in this example which is 3.6 TB.

Check free space

df -h

Filesystem Size Used Available Use% Mounted on

NFS 35.0T 6.7T 28.3T 19% /vmfs/volumes/datastore_1

In this example, the disk usage and free space reported by the NFS server are not consistent.

Check datastore usage reported by NFS server

On ESX Host, capture tcpdump on the NFS port and vmknic through which the NFS Server is connected and
run df -h.

tcpdump-uw -i <vmknic> port 2049 -w /vmfs/volumes/df.pcap

Analyse the packet capture using the steps below.

Select any directory from the datastore, see ls-datastore.txt above for full list.

/vmfs/volumes/abcd1234-abcd1234/vm-0079a88c-e317-440a-bf4e-01c28b01b152 is selected in this example

Using the directory name from above, identify the file handle hash for the datastore.

tshark -2 -Tfields -e frame.number -e frame.time_relative -e rpc.xid -e nfs.name -e nfs.fh.hash -Y >'rpc.procedure == 3 && rpc.msgtyp == 0' -r df.pcap | grep vm-0079a88c-e317-440a-bf4e-01c28b01b152

45460 22.057884000 0x35164b64 vm-0079a88c-e317-440a-bf4e-01c28b01b152 0xc4045aca

The file handle hash for the datastore is 0xc4045aca, use this file handle hash to get the xid in FSSTAT
tshark -2 -Tfields -e frame.number -e frame.time_relative -e rpc.xid -Y 'rpc.procedure == 18 && >rpc.msgtyp == 0 && nfs.fh.hash == 0xc4045aca' -r df.pcap
57824 28.540564000 0x35165c9

The xid is 0x35165c9, use this to filter FSSTAT reply.
tshark -2 -Tfields -e frame.number -e frame.time_relative -e rpc.xid -e nfs.fsstat3_resok.tbytes -e >nfs.fsstat3_resok.fbytes -Y 'rpc.procedure == 18 && rpc.msgtyp == 1 && rpc.xid == 0x35165c96' -r df.pcap
57825 28.540948000 0x35165c96 38482906972160 31118049619968

Total bytes 38482906972160 (35TB)
Free Bytes 31118049619968 (~28.30 TB)

In this example, the NFS server is reporting inaccurate free space.

Resolution

Once the source of the discrepancy is identified, engage the appropriate team for further assistance.