HCX - SM/NE appliance configuration issues related to Kafka - RecordTooLargeException
search cancel

HCX - SM/NE appliance configuration issues related to Kafka - RecordTooLargeException

book

Article ID: 321660

calendar_today

Updated On:

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

This resource is to inform about failure associated with HCX Service Mesh Fleet appliances configuration and how to recover that.

Symptoms:
During upgrade/redeploy/resync OR any other Configuration operation performed on a given Service Mesh (SM) containing more than 13 Fleet Appliances including (IX/NE/WO/SGW/SDR), below exceptions will be thrown in the log and the appliance lifecycle workflow won't be executed further:

2022-08-01 21:05:17.106 UTC [RemotingService_SvcThread-32926, Ent: HybridityAdmin, , TxId: ########-####-####-####-########b813] WARN  c.v.v.h.m.k.KafkaProducerDelegate- Publish failed and will retry
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The request included a message larger than the max message size the server will accept.

Location of App Engine log:

  • HCX Manager : /common/log/admin/app.log



Resolution

This behavior is expected as per Kafka messaging size allocated per SM in the backend and the recommendation is to follow HCX Service Appliance Provisioning Concurrency limit specified in Config Max Guide.

Workaround:
For greenfield/brownfield deployments, If user wants to increase the number of Network Extension (NE) appliance per SM depending upon their number of PG/extensions requirement, then suggestion is below:

  • Deploy additional SM per HCX site paired manager upto 32.
  • Deploy additional NE appliance among several SM and make sure to follow the limit of 13 appliance per SM as specified above.

For brownfield specific deployments, where a given SM contains already more than 13 appliances and if NE appliance count itself is (>10), we can follow below steps:

  • If SM appliance UPGRADE needs to be performed:
  • Go to Interconnect >> Service Mesh >> UPDATE APPLIANCES
  • Untick all appliances from the window and select upto 13 appliances together including IX/WO/NE etc. for concurrent upgrade task.
  • If SM appliance REDEPLOY needs to be performed:
  • Go to Interconnect >> Service Mesh >> VIEW APPLIANCES

  • Untick all appliances from the window and select upto 13 appliances together including IX/WO/NE etc. for concurrent redeploy task.

  • For all other SM appliance lifecycle job, like MTU modifications under Network Profile NP and Compute Profile CP update where SM "RESYNC" needs to be performed, please open service request with VMware support team for further assistance.



Additional Information

Impact/Risks:
This issue may impact both new and existing service mesh operation in the given conditions where the number of Fleet appliances per SM exceeds the recommended value 13 including [1 IX + 1 WO + 1 SGW/SDR + 10 NE], as specified in VMware HCX Config Max Guide.
IX - Interconnect Appliance
WO - WAN Optimization Appliance
NE - Network Extension Appliance
SGW - Sentinel Gateway Appliance at Source for OSAM
SDR - Sentinel Receiver Appliance at Target for OSAM
This will impact all SM appliance lifecycle job including Upgrade/Redeploy & Resync operation against any configuration changes like MTU modifications under Network Profile or some changes in Compute Profile parameters etc.