Exit status 32 when mounting a volume to a container in TAS for VMs

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This KB applies to the following situations:

An NFS service instance that points to a remote NFS or a SMB service instance that points to a remote smb.
An app is bound to the service instance, which then fails to start after restarting/restaging the app.
Component logs indicate an exit status 32 for the failed volume mount.

Environment

TPCF: All versions

Resolution

When investigating why a nfs volume is failing to mount, start by looking at the /var/vcap/sys/log/nfsv3driver and rep logs on the Diego cell where the application container crashed.

To find the diego-cell where the application container crashed, the following api calls will help us in identifying this information.

First, obtain the app guid of the crashing application:

cf app <app-name> --guid

Then substitute GUID in the following api call with the guid obtained above:

cf curl "/v2/events?q=actee:GUID&results-per-page=100&order-direction=desc&page=1" > /var/tmp/appevents.log

Open /var/tmp/appevents.log in a text editor and find the latest crash event:

"resources": [
      {
         "metadata": {
            ## REMOVED FOR BREVITY ##
         },
         "entity": {
             ## REMOVED FOR BREVITY ##
            "metadata": {
                ## REMOVED FOR BREVITY ##

               "cell_id": "e1f9aecb-240e-4300-b704-27b532f24efa",
               "exit_description": "failed to mount volume",
               "reason": "CRASHED"

            },
            ## REMOVED FOR BREVITY ##
      },

cell_id is e1f9aecb-240e-4300-b704-27b532f24efa in this example.

If using NFS:

Obtain the logs for that diego cell and review the rep and/or nfsv3driver logs.
The error in the nfsv3driver logs will look like the following:

{"timestamp":"2021-03-16T14:22:53.478362393Z","level":"error","source":"nfs-driver-server","message":"nfs-driver-server.server.handle-mount.with-cancel.mount.mount.invoke-mount-failed","data":{"error":"exit status 32","session":"2.106082.1.1.5","volume":"<volume-dir>"}}

The error in the rep logs will look like the following:

{"timestamp":"2021-03-16T14:22:53.479122035Z","level":"error","source":"rep","message":"rep.executing-container-operation.ordinary-lrp-processor.process-reserved-container.run-container.containerstore-create.node-create.mount.mount.remoteclient-mount.failed-mounting-volume","data":{"container-guid":"<container-guid>","container-state":"reserved","error":"{\"SafeDescription\":\"exit status 32\"}","guid":"<guid>","lrp-instance-key":{"instance_guid":"<instance-guid>","cell_id":"e1f9aecb-240e-4300-b704-27b532f24efa"},"lrp-key":{"process_guid":"<proc-guid>","index":0,"domain":"cf-apps"},"mount_request":{"Name":"<nfs-dir>"},"session":"10269.1.1.3.2.1.2.1.2"}}

Both logs indicate that the volume mount failed with exit status 32.

One of the reasons this occurs is because the nfs URL is not resolvable from the diego cell.

For example, if your service instance points towards the remote nfs fs-********.efs.us-east-1.amazonaws.com, then that url must be resolvable for the nfsv3driver. Confirm it is resolvable by performing a nslookup on the url from the diego cell.

First ssh into the diego cell, then run the following:

nslookup fs-xxxxxxxx.efs.us-east-1.amazonaws.com

Example:

diego_cell/e1f9aecb-240e-4300-b704-27b532f24efa:~$ nslookup fs-xxxxxxxx.efs.us-east-1.amazonaws.com
;; Got recursion not available from 169.254.0.2, trying next server
Server:     10.10.10.10
Address:    10.10.10.10#53
** server can't find fs-xxxxxxxx.efs.us-east-1.amazonaws.com: NXDOMAIN

This confirms that the nfs url is not resolvable, therefore the nfsv3driver will also face the same issue when trying to mount a volume.

An additional troubleshooting step that may be helpful is to test the mount command manually on the diego cell. For example:

mkdir /var/vcap/data/volumes/nfs/local_test_dir

mount -t nfs -o "rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,actimeo=0" fs-********.efs.us-east-1.amazonaws.com:/ /var/vcap/data/volumes/nfs/local_test_dir

In our example you would see the following from the manual mount:

diego_cell/e1f9aecb-240e-4300-b704-27b532f24efa:~$ sudo mount -t nfs -o "rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,actimeo=0" fs-xxxxxxxx.efs.us-east-1.amazonaws.com:/ /var/vcap/data/volumes/nfs/local_test_dir

mount.nfs: Failed to resolve server fs-xxxxxxxx.efs.us-east-1.amazonaws.com: Name or service not known

If using SMB

Obtain the logs for that diego cell and review the rep and/or /var/vcap/sys/log/smbdriver logs.
The error in the smbdriver logs will look like the following:

{"timestamp":"2021-08-06T12:08:12.887919156Z","level":"error","source":"smb-driver-server","message":"smb-driver-server.server.handle-mount.with-cancel.mount.mount.mount-failed: ","data":{"error":"exit status 32","session":"2.22.1.1.4","source":"//smbserver/sharepoint","target":"/dir-mount-point","volume":"aea47d29-7323-4409-a30b-91737c22377c-692b950d4b0629b8d448ae1dfcbcf1aa_ee5c73da-ab95-483b-5580-3857"}}

The log above does not give much info so you would need to gather more info
First ssh into the diego cell, then run try to mount manually using the cli using the parameters you used when you created the smb service instance. Provide username and password credentials to connect to the smbserver

mount -t cifs --username <usermame> <smbserver/sharepoint> <dir-mount-point>
Password: *****
Permission denied

On the above example mount failed since the credentials used is not correct or does not have the necessary permission to access smbserver

If you find that your volume mount is failing with exit status 32 but the above information does not give you detailed info as to why its failing, please contact Support.