HCX - Health Check and Best Practices
search cancel

HCX - Health Check and Best Practices

book

Article ID: 371941

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

This article provides recommendations for conducting a health check of HCX products and general best practices. It is important to note that this article offers general recommendations, which may not apply to all scenarios or cover every aspect of health checks or best practices. Additional documentation links will be provided for further details and understanding related to the discussed topics. For a comprehensive health check and detailed guidelines on best practices, please engage VMware by Broadcom Professional Services.

Please note: In public cloud or hybrid deployments, it may not be possible to implement the recommendations presented in this article.

Environment

HCX

Resolution

General Guidelines:

  • Ensure interoperability of HCX with all products in your environment (e.g., vCenter, ESXi, NSX, SRM, Cloud Director, etc.) by referencing the Interoperability Matrix.
  • It is recommended to upgrade both HCX Managers (Connector and Cloud) and Fleet Appliances to the latest available release, as the update may include new features, software fixes and security patches. Confirm that the target HCX version is compatible with other products in the environment, and ensure the HCX upgrade path is supported by reviewing the release notes of the target version.
  • For information regarding support and versions in legacy environments (vSphere 6.5 / vSphere 6.7), consult: HCX Support Policy for Legacy vSphere Environments.
  • Ensure your system meets the necessary requirements to run HCX Manager and Fleet Appliances.
  • Ensure all required ports are opened. For information on the required ports, access VMware Ports and Protocols and Network Diagrams for VMware HCX.
  • Ensure that you do not exceed supported configurations for your environment and stay within the limits supported by HCX (e.g., HCX Sites and Managed Services, Migrations, Network Extension, etc.). Please consult the VMware Configuration Maximums for guidance.
  • Backup HCX Manager by configuring backups using the appliance management interface at  <https://hcx-ip-or-fqdn:9443>. Navigate to Administration -> Troubleshooting > Backup & Restore. The best practice is to schedule Daily backups. Restoring from backup files that are more than two days old is not supported due to potential inventory changes from the backup time to present. For more information, access Backing Up HCX Manager.
  • Verify the HCX Manager system reports healthy connections to: vCenter Server, NSX Manager (if applicable), Cloud Director/RMQ (if applicable). Use the appliance management interface at <https://hcx-ip-or-fqdn:9443>. Navigate to Dashboard and confirm the status is healthy (green).
  • Verify that the HCX Manager reports that there are healthy connections to the HCX Interconnect service components. Navigate to HCX Manager UI -> Interconnect > Service Mesh and confirm the status is healthy (green).
  • Verify that Site Pair configurations are healthy. Navigate to HCX Manager UI -> Site Pairing. For troubleshooting Site Pairing issues, refer to KB: HCX - Site Pairing Connectivity Diagnostics.

Migrations Recommendations:

  • Navigate to HCX Manager UI -> Interconnect > Service Mesh > Run Diagnostics and review the results for any errors. The diagnostics will test connectivity from the IX appliance to the required components (e.g., vCenter, ESXi hosts, etc.) and identify any issues related to network communication. If there are any errors related to closed ports, review the network and firewall configuration. For more information on the required ports, refer to the VMware Ports and Protocols and Network Diagrams for VMware HCX.
  • In the HCX Manager UI, navigate to Transport Analytics to verify underlay network performance for Service Mesh uplinks. Based on the bandwidth reported by the uplink test and the bandwidth requirement of the HCX service, HCX calculates the underlay network performance associated with the uplink. From that calculation, HCX recommends a number of simultaneous migrations (currently available for Bulk Migration only) that the underlay network can support at the point when the test was run. For more information, consult Verifying Underlay Network Performance for Service Mesh Uplinks.
  • Ensure you meet the minimum network underlay requirements for HCX Migrations. For more details, visit Network Underlay Minimum Requirements.
  • Allocate IP addresses for the HCX migration appliance (HCX-IX) from existing Management or Replication networks to optimize the data path and simplify troubleshooting.
  • Use the HCX compute profile to configure CPU and Memory reservations. Resource reservations configured directly in vCenter Server are not persistent to HCX lifecycle operations.
  • Avoid the use of Fully Automated DRS mode, as it can impact migration by disrupting checksum operations and potentially triggering a full synchronization, which delays migration progress. Note: If the Fully Automated DRS configuration must be used, the Service Mesh appliances (IX/WO) should be excluded to maximize service stability. This exclusion is not persistent to upgrade and redeploy operations. Additional considerations for DRS rules include:
    • HCX migration appliances (IX) may benefit from anti-affinity rules to place IX appliances on different hosts when multiple service meshes are deployed. This allows vMotion/RAV (Replication Assisted vMotion) operations to be executed in parallel instead of queuing for serial execution.
    • HCX WAN Optimization appliance may benefit from an affinity rule that places the HCX-WO and HCX-IX on the same host. This rule simplifies and optimizes the service chained traffic data path between HCX-WO and HCX-IX.
  • HCX WAN Optimization increases efficiency by providing data reduction when deployed with network underlays with bandwidth below 1 Gbps. When WAN Optimization is configured, ensure that the storage supports 5000 IOPS per WO appliance.
  • When using Bulk or RAV (Replication Assisted vMotion) migration, enable 'Seed Checkpoint'. In the event that a migration is unsuccessful or canceled, Seed Checkpoint retains the target disks created at the target site. Without Seed Checkpoint, the HCX roll back process cleans up the target disks created during the migration and all transferred data is lost. 
  • Ensure there is sufficient space in the target Data Store. Up to 20% extra space may be used temporarily during the migration.
  • Running backup services that use VM snapshots or taking snapshots of VMs while being migrated can disrupt the migration process. For more information, consult HCX Interoperability with Backup Solutions.
  • Quiesce the VM before scheduling the migration to minimize data churn.
  • Switchover window should be over estimated to accommodate for the lengthy data checksum process and instantiation of the VM on target.
  • In certain cases, Bulk migration workflow may take more time to complete switchover stage, when an extremely large VM is getting migrated using HCX.
  • In the event that a VM takes longer or cannot be shutdown gracefully from GuestOS, the recommendation is for the customer to enable "Force Power Off" upon scheduling the migrations. Refer to KB HCX - Bulk Migration may fail due to "Invalid Power State" of VM.
  • Do not perform app/web engine restart from source and target HCX manager during course of an ongoing migration as it may impact migration workflow at a given point of time.
  • Do not restart vCenter services during an ongoing migration.
  • Do not power off source VM manually after completion of initial base sync as it may impact offline sync workflow at a given point of time.
  • For detailed information on Bulk migration operations and best practices. Please visit HCX: Bulk Migration Operations and Best Practices.

Network Extension Recommendations:

  • Navigate to HCX Manager UI -> Interconnect > Service Mesh > Run Diagnostics and review the results for any errors. The diagnostics will test connectivity between HCX NE tunnels and provide the foutrace/traceroute for the datapath or pinpoint the location of the traffic interruption.
  • Measuring the Network Underlay's Bandwidth using 'perftest all' from the HCX Central CLI. For details on interpreting the results and running the perftest, refer to the Network Underlay Characterization and HCX Performance Outcomes document, specifically pages 11 to 13. Performance can vary depending on factors like MTU, Latency, Environment Traffic, Network Bandwidth, CPU capabilities, and Memory resources.
  • Ensure you meet the General Network Underlay Requirements. For more details, visit Network Underlay Minimum Requirements.
  • Use the HCX compute profile to configure CPU and Memory reservations. Resource reservations configured directly in vCenter Server are not persistent to HCX lifecycle operations.
  • Before proceeding with a network extension, review the documentation:
  • When using NE HA, it's recommended to have the management and uplink interfaces on different network ranges and to use different physical vmnic uplinks for these port groups for additional BFD heartbeat.
  • Avoid the use of Fully Automated DRS mode, as excessive network path changes can result in the Network Extension flooding RARP for the VM traffic path adjustments. Note: If the Fully Automated DRS configuration must be used, the Service Mesh appliances (NE) should be excluded to maximize service stability. This exclusion is not persistent to upgrade and redeploy operations.
  • Enable Traffic Engineering in HCX Enterprise: Application Path Resiliency and TCP Flow Conditioning:
    • The Application Path Resiliency service creates multiple tunnel flows, for both Interconnect and Network Extension traffic, those may follow multiple paths across the network infrastructure from the source to the destination data centers. The service then intelligently forwards traffic through the tunnel over the optimal path and dynamically switches between tunnels depending on traffic conditions. Application Path Resiliency forwards traffic over one tunnel at a time and does not load balance across multiple paths.
    • The TCP Flow Conditioning service adjusts the segment size during the TCP connection handshake between end points across the Network Extension. This optimizes the average packet size to reduce fragmentation and lower the overall packet rate.
  • When necessary, enable Mobility Optimized Networking (MON) to improve network performance and reduces latency for virtual machines that have been migrated to the cloud on an extended L2 segment. MON provides these improvements by allowing more granular control of routing to and from those virtual machines in the cloud. For more information, review the Understanding Network Extension with Mobility Optimized Networking documentation. For troubleshooting, refer to KB: HCX - Mobility Optimized Networking (MON) Troubleshooting Guide.
  • When modifying configurations in a production environment, exercise caution and schedule a maintenance window to avoid potential network outages. If required, please open a Proactive HCX Maintenance Window case.

 

Additional Information