According to the Spring Cloud Services (SCS) Config Server architecture, a “mirror service” makes a full mirror of each Git repository used by a Config Server service instance. The "mirror service" is implemented as a mirror-service job on SCS broker VM.
After the config-server back app is started up, it will not attempt "git clone" to mirror-service to fetch the properties to local. When the client app binding to the config-server service requests config properties from the config-server service, config-server back app will do "git clone" to fetch a copy of the repo from mirror service. When there are many config-server service instances on a TAS foundation and the client apps binding to these services request config properties at almost same time, it would trigger a high volume of git operations against mirror-service, which might cause overload of mirror-service. Usually this would occur during a TAS or Isolation segment tile upgrade when many diego_cell instances are updated simultaneously (a large max-in-fight setting) which triggers restart of many client apps due to evacuation.
For example, if a config-server back app has 6 instances, it could send out 6 simultaneous git operations to mirror-service. If there are many config-server service instances including large amount of application instances, the mirror-service could receive a high volume of git operations (clone/fetch) within a very short of period, hence become overloaded.
When the mirror-service is overloaded, config-server app may report errors in cloning repo to local directory as shown below. This will cause client app unable to startup due to failure in loading properties from config-server.
......
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033611626Z OUT ESC[35m[http-nio-8080-exec-1]ESC[0;39m ESC[31mWARN ESC[0;39m o.s.c.c.s.e.MultipleJGitEnvironm
entRepository.cloneToBasedir - Error occured cloning to base directory.
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033628708Z OUT org.eclipse.jgit.api.errors.TransportException: Read timed out after 5,000 ms
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033631914Z OUT  at org.eclipse.jgit.api.FetchCommand.call(FetchCommand.java:249)
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033634048Z OUT  at org.eclipse.jgit.api.CloneCommand.fetch(CloneCommand.java:319)
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033636062Z OUT  at org.eclipse.jgit.api.CloneCommand.call(CloneCommand.java:189)
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033638847Z OUT  at org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.cloneToBasedir(JGitEnvironmentRepository.java:657)
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033641272Z OUT  at org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.copyRepository(JGitEnvironmentRepository.java:632)
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033643516Z OUT  at org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.createGitClient(JGitEnvironmentRepository.java:615)
0b795621-####-####-####-46160ef3c44e APP/PROC/WEB/0 2024-10-27T01:02:08.033646421Z OUT  at org.springframework.cloud.config.server.environment.JGitEnvironmentRepository.refresh(JGitEnvironmentRepository.java:295)
......
Spring Cloud Services (SCS) v3.3.8 has changed the default refresh-rate from 0 to 60 and the default git timeout from 5 to 30. SCS v.3.3.9 introduces the capability to tune these settings globally.
For large environments, it's recommended to upgrade SCS to v3.3.8 or higher and run upgrade-all errand BEFORE upgrading TPCF or performing any action that will restart Diego cells. This will allow Git refresh rate to get updated to new value before application instances restart (which triggers a refresh).
SCS doesn't support High Availability (HA), it's not able to configure multiple SCS broker VMs to share the high volume of git requests from config-server side. Here are some considerations that can be taken to mitigate the issue.
1) "Read timed out after 5,000 ms Increase time out" indicates timeout happened within 5 seconds (default value) when config-server attempts "git clone" from mirror-service. When mirror-service is overloaded, it could take longer than 5 seconds to get "git clone" completed. Try to increase timeout value by adjusting SPRING_CLOUD_CONFIG_SERVER_GIT_TIMEOUT environment variable for each config-server back app. Or run the following command if your are on SCS 3.2+.
cf update-service <SI> -c '{ "git": { "timeout": 30, "uri": "bitbucket", ...there rest of git config} }'
Refer to this KB article for more details.
2) Scaling up the SCS broker VM by adding more CPU cores in order to give the VM more compute power to accomodate the high volume of the incoming git operations
3) In the case that many client apps are hosted on isolation segments, try to manually refresh mirror of each config-server instance after TAS upgrade completes but before starting upgrade of isolation segment tiles. During TAS tile update, all config-server back apps will be restarted with an empty local repo. When later client apps on isolation segments request properties from config-server, config-server will trigger a "git clone" from mirror-service (since it doesn't have local repo). Refreshing mirror operation will make config-server app get a local repo from mirror-service in advance. Therefore it could serve the client app request from its local repo instead of interacting with mirror-service. Refer to the document for different options of refreshing the mirror of config-server instance.
4) Set max-in-flight parameter to a reasonable value for updating TAS or isolation segment tiles. This would help to limit concurrent number of app restarts during update of diego_cell instances
5) Set the RefreshRate parameter of config-server instance to a positive value other than the default value 0. With default value the config-server will trigger a "git fetch" request to the mirror-service whenever the client app asks for configuration parameters from even if there is a local repo. This parameter can be set with cf set-env command. For example,
cf set-env config-server spring.cloud.config.server.git.refreshRate 30
Run the above command for all config-server back apps followed by restaging. With this setting config-server won't perform "git fetch" within 30 seconds after the previous "git fetch" is done.
6) Check if cleanup could be done to shrink the Git repo size, which will make git request being processed more quickly