Diego v2.49 and above becomes incompatible with old Diego versions in Tanzu Application Service for VMs
search cancel

Diego v2.49 and above becomes incompatible with old Diego versions in Tanzu Application Service for VMs

book

Article ID: 298082

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Starting in Diego v2.49, it builds with GOLANG 1.15, where the Transfer-Encoding: identity is not allowed. However, Diego v2.48 and earlier, builds with GOLANG 1.14. GOLANG 1.14 allows you to set Transfer-Encoding: identity in the streaming message sent by some servers (Bulletin Board System (BBS)). This causes an incompatibility issue between the Diego server and client if either have Diego v2.49 or later and Diego v2.48 or earlier. 

For example, this situation may occur when upgrading Tanzu Application Service for VMs (TAS for VMs) and Isolation Segment (ISO) tile to v2.9.19. TAS v2.9.19 has Diego 2.48 while ISO 2.9.9 has Diego 2.49. Post upgrade, you can observe the route-emitter on ISO diego_cell attempted to establish socket connection to BBS on TAS for VMs every second. 

Many socket connections have the FIN_WAIT2 state on ISO diego_cell. The source port of the connections are changing all the time.
tcp        0      0 10.0.0.1:57040     10.0.0.2:8889      FIN_WAIT2   -
tcp        0      0 10.0.0.1:57032     10.0.0.2:8889      FIN_WAIT2   -
tcp        0      0 10.0.0.1:56978     10.0.0.2:8889      FIN_WAIT2   -
tcp        0      0 10.0.0.1:40344     10.0.0.2:8889      ESTABLISHED 2421440/route-emitt
tcp        0      0 10.0.0.1:56988     10.0.0.2:8889      FIN_WAIT2   -
tcp        0      0 10.0.0.1:40346     10.0.0.2:8889      ESTABLISHED 2421440/route-emitt

TCP packets captured on ISO diego_cell showed that packets with SYN flag were sent to an active BBS instance every second.

Screen Shot 2021-11-15 at 12.10.31 PM.png

With debug log enabled on an active BBS instance, a "/v1/events/lrp_instances.r1" request was sent by ISO diego_cell every second.
{"timestamp":"2021-10-26T06:54:39.981561459Z","level":"debug","source":"bbs","message":"bbs.request.serving","data":{"method":"POST","remote_addr":"10.0.0.2:36504","request":"/v1/events/lrp_instances.r1","session":"2054"}}
{"timestamp":"2021-10-26T06:54:40.984055612Z","level":"debug","source":"bbs","message":"bbs.request.serving","data":{"method":"POST","remote_addr":"10.0.0.2:36518","request":"/v1/events/lrp_instances.r1","session":"2067"}}
{"timestamp":"2021-10-26T06:54:41.986528557Z","level":"debug","source":"bbs","message":"bbs.request.serving","data":{"method":"POST","remote_addr":"10.0.0.2:36522","request":"/v1/events/lrp_instances.r1","session":"2079"}} 

4. Due to another known issue where BBS socket connections could be kept alive unnecessarily, lots of socket connections with state of CLOSE_WAIT got accumulated on the active BBS. Eventually, the file descriptor system resource was used up and "too many open files" error was outputted:
2021/09/20 20:15:26 http: Accept error: accept tcp 0.0.0.0:8889: accept4: too many open files; retrying in 1s  


Environment

Product Version: 2.9

Resolution

It's expected that route-emitter would establish a stable socket connection to the active BBS instance and submit "/v1/events/lrp_instances.r1" request. However, due to the incompatibility issue described above, route-emitter shutdown the socket connection immediately upon getting the first response from BBS. It then kept retrying the same operation every second forever.

Currently only TAS for VMs, ISO and TAS for Windows tiles include this Diego release. To avoid this type of incompatibility issue, do not install Diego them on same foundation with a mix of Diego releases <=2.48 and >=2.49. 

It is also suggested to upgrade to TAS for VMs v2.9.27+, v2.10.19+, v2.11.7+, v2.12), which includes the following fix: "[Security Fix] Fixes an issue where BBS socket connections could be kept alive unnecessarily."