Assist in troubleshooting vSAN Encryption issues.
- Unable to configure encryption.
- Disks with vSAN encryption failed to mount.
- vSAN health warnings for KMS configuration.
VMware vSAN 7.x
VMware vSAN 8.x
The documented steps are intended to troubleshoot vSAN encryption issues.
It's important to Remember, that once a host has possession of a key, it is kept in memory until the host is rebooted. So, a loss of connectivity to KMS or vCenter causes no issues unless the host is rebooted. Therefore, if the KMS has suffered a permanent failure and you know the keys cannot be retrieved again, DO NOT reboot any hosts.
esxcli vsan encryption info get
esxcli vsan encryption kms list
esxcli vsan encryption hostkey get
esxcli vsan encryption cert path list
esxcli vsan encryption cert get
In the absence of vCenter, you will need to verify which KMS servers the ESXi server will attempt to contact to retrieve any keys. To do this, grep for ‘kmip
’ in the esx.conf
file.
[root@hostname:~] grep kmip /etc/vmware/esx.conf
/vsan/kmipServer/child[0001]/old = "false"
/vsan/kmipServer/child[0001]/port = "5696"
/vsan/kmipServer/child[0001]/address = "####.####.####.####" <--- IP Address of KMS Servers
/vsan/kmipServer/child[0001]/name = "KMS#"
/vsan/kmipServer/child[0001]/kmipClusterId = "KMSCluster"
/vsan/kmipServer/child[0001]/kmskey = "KMSCluster/KMS#"
/vsan/kmipServer/child[0000]/kmskey = "KMSCluster/KMS#"
/vsan/kmipServer/child[0000]/kmipClusterId = "KMSCluster"
/vsan/kmipServer/child[0000]/name = "KMS#"
/vsan/kmipServer/child[0000]/address = "####.####.####.####" <--- IP Address of KMS Servers
/vsan/kmipServer/child[0000]/port = "5696"
/vsan/kmipServer/child[0000]/old = "false"
/vsan/kmipClusterId = "KMSCluster"
In this example, there is one KMIP cluster, called ‘KMSCluster’ and 2 KMIP servers in the cluster, indicated by the [child0000]
and [child0001]
entries. You can validate the IP/FQDN and ports by checking these entries.
If the original KMS Server had been removed from vCenter, it must be added back to vCenter using exactly the same kmipClusterId
, or the hosts will assume it is a brand new cluster and any keys referencing the Cluster as the source will not be retrievable.
As of ESXi 7.0 we switched from esx.conf
to configstorecli
. To get the KMS info from a host running 7.0 or higher using a Native Key Provider run the below command:
[root@hostname:~] configstorecli config current get -c esx -g trusted_infrastructure -k "kms_providers"
[
{
"native_provider": {
"key_derivation_key": "*******",
"key_id": "########-####-####-####-############"
},
"provider": "<name>",
"type": "NATIVE"
}
]
"providers": [
{
"key_server": {
"connection_timeout": -1,
"kmip_server": {
"servers": [
{
"hostname": "<providername>:kmx",
"name": "NativeKeyProvider",
"port": 0
}
],
"username": ""
},
"proxy_server": {
"hostname": "",
"port": -1
},
"type": "KMIP"
},
"master_key_id": "<keyid>",
"old": false,
"provider": "NativeKeyProvider"
}
Note: Output modified to show only the relevant encryption information.
[root@hostname:~] configstorecli config current get -c vsan -g system -k "vsan"|less
"enabled": true,
"encryption": {
"changing": false,
"dek_generation_id": 1,
"enabled": true,
"erase_disks_before_use": false,
"host_key_id": "<host key ID for the host the command was run on>",
"kek_id": "<vSAN KEK>",
"kmip_cluster_id": "CloudLink Cluster"
"in_transit_encryption": {
"enabled": false,
"rekey_interval": 1440,
"state": "SETTLED"
"providers": [
{
"key_server": {
"connection_timeout": -1,
"kmip_server": {
"credential": "*******",
"servers": [
{
"hostname": "hostname.com",
"name": "hostname",
"port": 5696
}
],
"username": "kmip_user"
},
"proxy_server": {
"hostname": "",
"port": -1
},
"type": "KMIP"
},
"master_key_id": "CloudLink Cluster",
"old": false,
"provider": "hostname"
},
{
"key_server": {
"connection_timeout": -1,
"kmip_server": {
"credential": "*******",
"servers": [
{
"hostname": "hostname.com",
"name": "hostname",
"port": 5696
}
],
"username": "kmip_user"
},
"proxy_server": {
"hostname": "",
"port": -1
},
"type": "KMIP"
},
"master_key_id": "CloudLink Cluster",
"old": false,
"provider": "hostname"
},
Note: The preceding command output excerpts are only examples. Environmental variables may vary depending on your environment.
[root@hostname:/var/log] cd /etc/vmware/ssl/
[root@hostname:/etc/vmware/ssl] ls
castore.pem openssl.cnf rui.crt vsan_kms_castore.pem vsan_kms_client.crt vsan_kms_client_old.crt vsanvp_castore.pem
iofiltervp.pem rui.bak rui.key vsan_kms_castore_old.pem vsan_kms_client.key vsan_kms_client_old.key
Check the /etc/vmware/ssl folder on the host to ensure that a copy of the vsan_kms_client.crt
exists along with a copy of the private key (vsan_kms_client.key). These files should be identical on all hosts in the cluster.
The vsan_kms_castore.pem file is a copy of the server certificate that the host uses to compare with the cert returned by the KMIP server during initial SSL handshake. If the server cert has been changed and does not match what ESXi has stored here, the connection will not be established.
If vCenter is available and the host is missing any of this information, vCenter will provide the host with copies of the certificates it has stored in VECS. The certificates that will be provided to the host can be found in the VECS.
nc -z <KMS Server Address> 5696
Sample output as below.
# nc -z 192.xx.xx.xxx 5696
Connection to 192.xx.xx.xxx 5696 port [tcp/http] succeeded!
Verify the host is in crypto-safe mode.
To enter crypto-safe mode, the host must be able to retrieve a special key called the HostKey.
This key is separate from any other keys that would be required to encrypt VMs or the vSAN datastore. This key is used by the host to encrypt core dumps. Without access to this key, the host will be unable to request any other keys from the KMS server, even if it is accessible.
When vSAN Encryption was first enabled on the cluster, the host transitioned to ‘crypto-safe’ mode for the first time and was assigned a key to install as its HostKey. The host will always look for this key, based on the key identifier, when booting up. The host will NOT attempt to retrieve, nor will it request, a different key if the original key is not available. So for the host to re-enter crypto-safe, this key MUST be available.
To determine if a HostKey has been installed (i.e. the host is crypto-safe), you can use the UI (if available).
Check that Encryption Mode is enabled. If it is not, attempt to enable through the UI.
If the host will not enter encryption mode, then it cannot retrieve its HostKey.
If the UI is not available, you can use the crypto-util utility on the host to see if a HostKey has been installed or not.
[root@hostname:~] crypto-util keys getkidbyname HostKeyvmware:key/fqid/<VMWARE-NULL>/HyTrust/04f631cc%2d84dd%2d11e8%2d8194%2d00505698ddb6
If a key value is returned, the host is in crypto-safe mode. If the message indicates that a HostKey has not been established, then the host is not in crypto-safe mode.
To investigate which key the host requires to enter crypto-safe mode, you can find this value by looking in the vCenter MOB. (The host MOB is no longer available but can be accessed via vCenter).
Navigate to the host page in the MOB: https://vcsa.domain.local/mob/?moid=host-18022 for example.To find the host-id, please navigate to vCenter server and click on the hostname on Cluster. You can find the host-id as highlighted below.
2. Click CryptoKeyId
As seen above the UUID of the key the host will require to enter crypto-safe mode.
3. Click ProviderId.
If vCenter is not available the HostKey identifier can only be gathered through log review.
In Hostd.log:
CryptoManager
’ in the hostd.log to see the host adding keys to its keyCache
. For example, my host logged the following when it successfully added the HostKey
to the cache:[root@hostname:~] grep CryptoManager /var/log/hostd.log
2018-07-11T07:37:45.992Z info hostd[2099589] [Originator@6876 sub=Solo.Vmomi opID=4b3daa3a-84dd-11e8-4b-bc3b user=:com.vmware.vsan.health] Activation [N5Vmomi10ActivationE:0x000000a14601e520] : Invoke done [IsEnabled] on [vim.encryption.CryptoManagerHost:ha-crypto-manager]
--> object = 'vim.encryption.CryptoManagerHost:ha-crypto-manager',
2018-07-11T07:37:46.159Z info hostd[2099206] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc43 user=vpxuser:com.vmware.vsan.health] Host has been placed in Crypto-prepared state
2018-07-11T07:37:46.166Z info hostd[2099589] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc45 user=vpxuser:com.vmware.vsan.health] Adding host key 04f631cc-84dd-11e8-8194-0xxxxxxxxxx6 to the Key Cache
2018-07-11T07:37:46.166Z info hostd[2099589] [Originator@6876 sub=Hostsvc.CryptoManager opID=4b3daa3a-84dd-11e8-4b-bc45 user=vpxuser:com.vmware.vsan.health] Host has been placed in Crypto-safe state
2018-07-11T09:27:32Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:27:32Z jumpstart[2097479]: 2018-07-11T09:27:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########6 from KMS KMS1: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: 2018-07-11T09:28:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########
6 from KMS KMS2: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:28:32Z jumpstart[2097479]: 2018-07-11T09:28:32Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Failed to retrieve key from key management server cluster HyTrust. Will have 1 retries.
2018-07-11T09:28:37Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T09:29:37Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:29:37Z jumpstart[2097479]: 2018-07-11T09:29:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########
6 from KMS KMS1: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: VsanUtil: Failed to connect to key server, Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: 2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########
6 from KMS KMS2: Err:QLC_ERR_COMMUNICATE Failed to establish TCP connection to server
2018-07-11T09:30:37Z jumpstart[2097479]: 2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Failed to retrieve key from key management server cluster HyTrust. Will have 0 retries.
2018-07-11T09:30:37Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Failed to retrieve key from key management server cluster HyTrust
NOTE: Since the hosts will attempt to communicate with each server in the cluster, presence of ‘QLC_ERROR_COMMUNICATE
’ will typically indicates a network communication issues between hosts.
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:0
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Joining vSAN cluster 52faacd9-6a43-a600-e0b8-0
b##########
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: SyncConfigurationCallback called
2018-07-11T10:19:16Z jumpstart[2097479]: VsanSysinfo: Loading module cmmds
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Retrieving the host key with keyId: 04f631cc-84dd-11e8-8194-0
6##########
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:19:16Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:19:16Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 1 retries.
2018-07-11T10:19:21Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:19:21Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:19:21Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 0 retries.
2018-07-11T10:19:21Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Invalid key or certs
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:0
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts KMS certs not found
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Joining vSAN cluster 52faacd9-6a43-a600-e0b8-0##########
b
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: SyncConfigurationCallback called
2018-07-11T10:45:50Z jumpstart[2097479]: VsanSysinfo: Loading module cmmds
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Retrieving the host key with keyId: 04f631cc-84dd-11e8-8194-0##########
6
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:45:50Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:45:50Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 1 retries.
2018-07-11T10:45:55Z jumpstart[2097479]: VsanUtil: Get kms client key and cert, old:1
2018-07-11T10:45:55Z jumpstart[2097479]: VsanUtil: GetKmsServerCerts Old KMS certs not found
2018-07-11T10:45:55Z jumpstart[2097479]: VsanInfoImpl: Failed to retrieve host key from KMS: Invalid key or certs. Will have 0 retries.
2018-07-11T10:45:55Z jumpstart[2097479]: VsanInfoImpl: Failed to load DEKs: Invalid key or certs
This message indicates that the KMS’s cert has not been stored.
Copying key from Working Host:
[root@hostname:/etc/vmware/ssl] cat vsan_kms_castore.pem
-----BEGIN CERTIFICATE-----
MIIDvTCCAqWgAwIBAgIFANEDIiYwDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
SpQQLt8G3Zk9Yz75yfjSREHbJ0XHLqX25k9SwJaP20vf+Bz/tQFilpg+To6plw2z
xYzApJGjNEL0+k7W5YquUr5foFjAlrNW3GNzzYtt3CqKDSt201BchE82UYBgTzlb
MA==
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDvTCCAqWgAwIBAgIFAM+OvdYwDQYJKoZIhvcNAQELBQAwVzELMAkGA1UEBhMC
VVMxFTATBgNVBAoTDEh5VHJ1c3QgSW5jLjExMC8GA1UEAxMoSHlUcnVzdCBLZXlD
<<__snip__>>
m6hsrmBfRTSTbPpRimDXXQ7weBehjCHkIpKOqBUtNRVN4qArvkSO/cwZCB/7y7Gr
3A==
-----END CERTIFICATE-----
Option 1:
Repopulate this file by opening a browser and pointing at https://<KMS_Address>:5696
and copying the cert presented by the browser.
vsan_kms_castore.pem
file.
Option 2 - copy the file from a working host if the server cert has not been changed.
2018-07-11T10:59:34Z jumpstart[2097476]: VsanUtil: Failed to connect to key server, QLC_ERR_NEED_AUTH
2018-07-11T10:59:34Z jumpstart[2097476]: VsanInfoImpl: Failed to retrieve key 04f631cc-84dd-11e8-8194-0##########
6 from KMS KMS1: QLC_ERR_NEED_AUTH
QLC_ERR_NEED_AUTH
is a clear indication that the host’s copy of the server cert does not match the cert the server is presenting when the SSL handshake is taking place. If this is the case and vCenter is not available, you will have to use Option1
above.Note: Please be careful in rebooting hosts as this can negatively impact troubleshooting and recovery efforts.