1) The load seen seems to be on par with the highest loaded instances we have in us SaaS production. The TAS sizing was not appropriate for the load.
Detail: The issues we have seen in the customer environment were due to TAS utilization of over 120%. The vertex and edge caches were at full capacity which caused the increase in the response times. We upsized the TAS instance to provide enough headroom to consume the spikes. We found out that the load of the system increased by 100% on 8th of the Apr. We see big vertices of size 200KiB to 300KiB.
2) Applied configuration changes to apmservices-gateway to prevent intermittent restarts.
1) Identity what changed on the 8th Mar and caused the spike in the load and increase in the count/size of the vertices
2) Ensure we are not overwhelming tas from backed services once it is under load and ART goes up.
3/ Explore if queries/stores executed have some low-hanging fruit potential for optimization. (Implementing features discussed during our discussions in ITC will take more time, testing, ...) .
technical note regarding timeouts;
Our services (even if in a different namespace) should not go through ingress but talk to apmservices-gateway k8s service. Unless something is adjusted by the installer our default nginx read/send timeouts should be 600 sec.
For internal service REST calls to apmservices-gateway 30 seconds timeout might be too aggressive. Historically for REST calls to internal services, the timeouts were 300 seconds with backoff. For the reactive calls, there are in general no timeouts now configured (with the exception to rate limiting on specific endpoints). The general suggestion is to move all data-store API calls to RSocket via data-store common clients package now.