Troubleshooting vSAN Encryption
search cancel

Troubleshooting vSAN Encryption

book

Article ID: 326769

calendar_today

Updated On: 01-06-2025

Products

VMware vSAN

Issue/Introduction

Assist in troubleshooting vSAN Encryption issues.

Symptoms:

- Unable to configure encryption.

- Disks with vSAN encryption failed to mount.

- vSAN health warnings for KMS configuration.

Environment

VMware vSAN 7.x
VMware vSAN 8.x

Cause

The documented steps are intended to troubleshoot vSAN encryption issues.

Resolution

NOTICE: If the vSAN cluster is configured with vSAN encryption and you see any of the KMS servers failure or experiencing communication issues, PLEASE DO NOT REBOOT ANY OF THE VSAN ESXI HOSTS WHICH ARE CONFIGURED WITH vSAN ENCRYPTION. THIS MAY LEAD TO DATA UNAVAILABILITY or LOSS.   Please contact Broadcom Support to investigate and assist you further on the issue.

Troubleshooting vSAN Encryption

Checklist

  • Ensure the KMS server is reachable and responding on the KMIP port (5696 by default). For initial configuration of vSAN Encryption, the vCenter and the ESXi hosts in the cluster will require connectivity, but for ongoing operation, only the hosts require it. vCenter is only required when configuration needs to be changed or when enabling/disabling Encryption.
  • Ensure the host has the right credentials to communicate with the KMS – i.e. the client cert exists and is the right type to establish trust with the KMS.
  • Ensure the host can enter crypto-safe mode. To do this, it requires access to its HostKey, which is a different key to that required to mount any encrypted diskgroups. Without this key, the host is not deemed secure enough to host encrypted disk groups or VMs.
  • Ensure the host has access to and can retrieve the required vSAN KEK.

It's important to Remember, that once a host has possession of a key, it is kept in memory until the host is rebooted. So, a loss of connectivity to KMS or vCenter causes no issues unless the host is rebooted. Therefore, if the KMS has suffered a permanent failure and you know the keys cannot be retrieved again, DO NOT reboot any hosts.

 

Commands to verify the configured KMS servers and status on ESXi hosts.

  • Get vSAN encryption infomation.

esxcli vsan encryption info get                

  • List the KMS configurations used for vSAN encryption.

esxcli vsan encryption kms list                                 

  • Get host key from keycache used for vSAN encryption.

esxcli vsan encryption hostkey get

  • List encryption certificate file paths.

esxcli vsan encryption cert path list

  • Get encryption KMS server certificate contents.

esxcli vsan encryption cert get

 

Review KMS configuration on ESXi hosts.

  • For ESXi host is 6.x version: 

In the absence of vCenter, you will need to verify which KMS servers the ESXi server will attempt to contact to retrieve any keys. To do this, grep for ‘kmip’ in the esx.conf file.

[root@hostname:~] grep kmip /etc/vmware/esx.conf
/vsan/kmipServer/child[0001]/old = "false"
/vsan/kmipServer/child[0001]/port = "5696"
/vsan/kmipServer/child[0001]/address = "####.####.####.####" <--- IP Address of KMS Servers
/vsan/kmipServer/child[0001]/name = "KMS#"
/vsan/kmipServer/child[0001]/kmipClusterId = "KMSCluster"
/vsan/kmipServer/child[0001]/kmskey = "KMSCluster/KMS#"
/vsan/kmipServer/child[0000]/kmskey = "KMSCluster/KMS#"
/vsan/kmipServer/child[0000]/kmipClusterId = "KMSCluster"
/vsan/kmipServer/child[0000]/name = "KMS#"
/vsan/kmipServer/child[0000]/address = "####.####.####.####" <--- IP Address of KMS Servers
/vsan/kmipServer/child[0000]/port = "5696"
/vsan/kmipServer/child[0000]/old = "false"
/vsan/kmipClusterId = "KMSCluster"

 In this example, there is one KMIP cluster, called ‘KMSCluster’ and 2 KMIP servers in the cluster, indicated by the [child0000] and [child0001] entries. You can validate the IP/FQDN and ports by checking these entries.

If the original KMS Server had been removed from vCenter, it must be added back to vCenter using exactly the same kmipClusterId, or the hosts will assume it is a brand new cluster and any keys referencing the Cluster as the source will not be retrievable.

  • For ESXi host is 7.x and later:

As of ESXi 7.0 we switched from esx.conf to configstorecli. To get the KMS info from a host running 7.0 or higher using a Native Key Provider run the below command:

[root@hostname:~] configstorecli config current get -c esx -g trusted_infrastructure -k "kms_providers"
[
{
     "native_provider": {
        "key_derivation_key": "*******",
        "key_id": "########-####-####-####-############"
     },
     "provider": "<name>",
     "type": "NATIVE"
  }
]
  "providers": [
     {
        "key_server": {
           "connection_timeout": -1,
           "kmip_server": {
              "servers": [
                 {
                    "hostname": "<providername>:kmx",
                    "name": "NativeKeyProvider",
                    "port": 0
                 }
              ],
              "username": ""
           },
           "proxy_server": {
              "hostname": "",
              "port": -1
           },
           "type": "KMIP"
        },
        "master_key_id": "<keyid>",
        "old": false,
        "provider": "NativeKeyProvider"
      }


Note: Output modified to show only the relevant encryption information.

  • If the cluster is using a 3rd party KMS run the below command (The below output has been modified to only show the related encryption info. The full output is around 471 lines hence why it's piped to less.)

[root@hostname:~] configstorecli config current get -c vsan -g system -k "vsan"|less
  "enabled": true,
  "encryption": {
     "changing": false,
     "dek_generation_id": 1,
     "enabled": true,
     "erase_disks_before_use": false,
     "host_key_id": "<host key ID for the host the command was run on>",
     "kek_id": "<vSAN KEK>",
     "kmip_cluster_id": "CloudLink Cluster"
  "in_transit_encryption": {
     "enabled": false,
     "rekey_interval": 1440,
     "state": "SETTLED"
  "providers": [
     {
        "key_server": {
           "connection_timeout": -1,
           "kmip_server": {
              "credential": "*******",
              "servers": [
                 {
                    "hostname": "hostname.com",
                    "name": "hostname",
                    "port": 5696
                 }
              ],
              "username": "kmip_user"
           },
           "proxy_server": {
              "hostname": "",
              "port": -1
           },
           "type": "KMIP"
        },
        "master_key_id": "CloudLink Cluster",
        "old": false,
        "provider": "hostname"
     },
     {
        "key_server": {
           "connection_timeout": -1,
           "kmip_server": {
              "credential": "*******",
              "servers": [
                 {
                    "hostname": "hostname.com",
                    "name": "hostname",
                    "port": 5696
                 }
              ],
              "username": "kmip_user"
           },
           "proxy_server": {
              "hostname": "",
              "port": -1
           },
           "type": "KMIP"
        },
        "master_key_id": "CloudLink Cluster",
        "old": false,
        "provider": "hostname"
      },

Note: The preceding command output excerpts are only examples. Environmental variables may vary depending on your environment.

 

Locating the KMIP Client cert.

[root@hostname:/var/log] cd /etc/vmware/ssl/

[root@hostname:/etc/vmware/ssl] ls
castore.pem               openssl.cnf               rui.crt                   vsan_kms_castore.pem      vsan_kms_client.crt       vsan_kms_client_old.crt   vsanvp_castore.pem
iofiltervp.pem            rui.bak                   rui.key                   vsan_kms_castore_old.pem  vsan_kms_client.key       vsan_kms_client_old.key

  •  Check the /etc/vmware/ssl folder on the host to ensure that a copy of the vsan_kms_client.crt exists along with a copy of the private key (vsan_kms_client.key). These files should be identical on all hosts in the cluster.

  • The vsan_kms_castore.pem file is a copy of the server certificate that the host uses to compare with the cert returned by the KMIP server during initial SSL handshake. If the server cert has been changed and does not match what ESXi has stored here, the connection will not be established.

  • If vCenter is available and the host is missing any of this information, vCenter will provide the host with copies of the certificates it has stored in VECS. The certificates that will be provided to the host can be found in the VECS.

 

To test connectivity, you can use nc command: 

nc -z <KMS Server Address> 5696

Sample output as below.

# nc -z 192.xx.xx.xxx 5696
Connection to 192.xx.xx.xxx 5696 port [tcp/http] succeeded!

 

Ensure the host can enter crypto-safe mode.

Verify the host is in crypto-safe mode.

  • To enter crypto-safe mode, the host must be able to retrieve a special key called the HostKey.

  • This key is separate from any other keys that would be required to encrypt VMs or the vSAN datastore.  This key is used by the host to encrypt core dumps. Without access to this key, the host will be unable to request any other keys from the KMS server, even if it is accessible.

  • When vSAN Encryption was first enabled on the cluster, the host transitioned to ‘crypto-safe’ mode for the first time and was assigned a key to install as its HostKey. The host will always look for this key, based on the key identifier, when booting up. The host will NOT attempt to retrieve, nor will it request, a different key if the original key is not available. So for the host to re-enter crypto-safe, this key MUST be available.

  • To determine if a HostKey has been installed (i.e. the host is crypto-safe), you can use the UI (if available).

  • Check that Encryption Mode is enabled.  If it is not, attempt to enable through the UI. 

  • If the host will not enter encryption mode, then it cannot retrieve its HostKey.

  •  If the UI is not available, you can use the crypto-util utility on the host to see if a HostKey has been installed or not.

    [root@hostname:~] crypto-util keys getkidbyname HostKey
    vmware:key/fqid/<VMWARE-NULL>/HyTrust/04f631cc%2d84dd%2d11e8%2d8194%2d00505698ddb6

  • If a key value is returned, the host is in crypto-safe mode. If the message indicates that a HostKey has not been established, then the host is not in crypto-safe mode.

  • To investigate which key the host requires to enter crypto-safe mode, you can find this value by looking in the vCenter MOB. (The host MOB is no longer available but can be accessed via vCenter).

    Navigate to the host page in the MOB: https://vcsa.domain.local/mob/?moid=host-18022 for example.

To find the host-id, please navigate to vCenter server and click on the hostname on Cluster. You can find the host-id as highlighted below.

  • Navigate to the host page in the MOB: https://vcsa.domain.local/mob/?moid=host-18022 for example.
    1. Click Runtime

     2. Click CryptoKeyId

As seen above the UUID of the key the host will require to enter crypto-safe mode.

    3.  Click ProviderId.

Log Identification:

If vCenter is not available the HostKey identifier can only be gathered through log review.

In Hostd.log:

  • Grep for the term ‘CryptoManager’ in the hostd.log to see the host adding keys to its keyCache. For example, my host logged the following when it successfully added the HostKey to the cache:

[root@hostname:~] grep CryptoManager /var/log/hostd.log
2018-07-11T07:37:45.992Z info hostd[2099589] [Originator@6876 sub=Solo.Vmomi opID=4b3daa3a-84dd-11e8-4b-bc3b user=:com.vmware.vsan.health] Activation [N5Vmomi10ActivationE:0x000000a14601e520] : Invoke done [IsEnabled] on [vim.encryption.CryptoManagerHost:ha-crypto-manager]
-->    object = 'vim.encryption.CryptoManagerHost:ha-crypto-manager',
2018-07-11T07:37:46.159Z info hostd[2099206] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc43 user=vpxuser:com.vmware.vsan.health] Host has been placed in Crypto-prepared state
2018-07-11T07:37:46.166Z info hostd[2099589] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc45 user=vpxuser:com.vmware.vsan.health] Adding host key 04f631cc-84dd-11e8-8194-0xxxxxxxxxx6 to the Key Cache
2018-07-11T07:37:46.166Z info hostd[2099589] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc45 user=vpxuser:com.vmware.vsan.health] Host has been placed in Crypto-safe state

  • Syslog.log can also be reviewed for errors as demonstrated below for TCP communication errors (i.e. the port is blocked, the KMS server is not responding etc.) as demonstrated below:

2018-07-11T09:27:32Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:27:32Z jumpstart[2097479]: 2018-07-11T09:27:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########6 from KMS KMS1: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: 2018-07-11T09:28:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########6 from KMS KMS2: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: 2018-07-11T09:28:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Failed to retrieve key from key management server cluster HyTrust. Will have 1 retries.
2018-07-11T09:28:37Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T09:29:37Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:29:37Z jumpstart[2097479]: 2018-07-11T09:29:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########6 from KMS KMS1: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: 2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########6 from KMS KMS2: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: 2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Failed to retrieve key from key management server cluster HyTrust. Will have 0 retries.
2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Failed to retrieve key from key management server cluster HyTrust

NOTE:  Since the hosts will attempt to communicate with each server in the cluster, presence of  ‘QLC_ERROR_COMMUNICATE’ will typically indicates a network communication issues between hosts.

  • If the problem sourcing from the client certificate or private key, syslog.log can be reviewed for the following errors:

2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:0
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Joining vSAN cluster 52faacd9-6a43-a600-e0b8-0##########b
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: SyncConfigurationCallback called
2018-07-11T10:19:16Z jumpstart[2097479]: VsanSysinfo: Loading module cmmds
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Retrieving the host key with keyId: 04f631cc-84dd-11e8-8194-0##########6
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 1 retries.
2018-07-11T10:19:21Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:19:21Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:19:21Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 0 retries.
2018-07-11T10:19:21Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Invalid key or certs

  • If there is a problem validating the server’s certificate, the issue will show up slightly differently. Syslog.log will show something like this if there is no KMS server cert saved in this location:

2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:0
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts KMS certs not found
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Joining vSAN cluster 52faacd9-6a43-a600-e0b8-0##########b
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: SyncConfigurationCallback called
2018-07-11T10:45:50Z jumpstart[2097479]: VsanSysinfo: Loading module cmmds
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Retrieving the host key with keyId: 04f631cc-84dd-11e8-8194-0##########6
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 1 retries.
2018-07-11T10:45:55Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:45:55Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:45:55Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 0 retries.
2018-07-11T10:45:55Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Invalid key or certs

This message indicates that the KMS’s cert has not been stored.
 
Copying key from Working Host:
 

  • On a properly functioning host, there should a copy of the server cert for each server in the cluster providing the key. E.g.:

[root@hostname:/etc/vmware/ssl] cat vsan_kms_castore.pem
-----BEGIN CERTIFICATE-----
MIIDvTCCAqWgAwIBAgIFANEDIiYwDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
SpQQLt8G3Zk9Yz75yfjSREHbJ0XHLqX25k9SwJaP20vf+Bz/tQFilpg+To6plw2z
xYzApJGjNEL0+k7W5YquUr5foFjAlrNW3GNzzYtt3CqKDSt201BchE82UYBgTzlb
MA==
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDvTCCAqWgAwIBAgIFAM+OvdYwDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
m6hsrmBfRTSTbPpRimDXXQ7weBehjCHkIpKOqBUtNRVN4qArvkSO/cwZCB/7y7Gr
3A==
-----END CERTIFICATE-----

 
Option 1:

Repopulate this file by opening a browser and pointing at https://<KMS_Address>:5696 and copying the cert presented by the browser.

  • It will need to be converted to a PEM file and copied into the vsan_kms_castore.pem file.
    • If more than one server exists per cluster, append the file with any additional certs so they appear one after another, with no spaces, in the vsan_kms_castore.pem file (You will need to use this option if the server cert has been changed.)


Option 2 - copy the file from a working host if the server cert has not been changed.
 

  • If there is a server certificate saved in the /etc/vmware/ssl folder, but it is not the correct certificate, you should see errors like the following in the syslog.log:

2018-07-11T10:59:34Z jumpstart[2097476]: VsanUtil: Failed to connect to key server, QLC_ERR_NEED_AUTH
2018-07-11T10:59:34Z jumpstart[2097476]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########6 from KMS KMS1: QLC_ERR_NEED_AUTH

  • The QLC_ERR_NEED_AUTH is a clear indication that the host’s copy of the server cert does not match the cert the server is presenting when the SSL handshake is taking place. If this is the case and vCenter is not available, you will have to use Option1 above.
  • If vCenter is available, use the UI options to re-establish trust with the KMS.

    This action will need to be performed for each KMS server individually.  

Note: Please be careful in rebooting hosts as this can negatively impact troubleshooting and recovery efforts.

Additional Information