The following output is from the /var/vcap/sys/log/auctioneer/auctioneer.stdout.log:
{"timestamp":"2019-11-12T07:17:15.391878171Z","level":"error","source":"auctioneer","message":"auctioneer.auction.failed-to-get-state","data":{"cell-guid":"b37a41cc-7150-4b18-b183-1d5f5718fa77","duration_ns":5000200574,"error":"Get https://a48b32dd-8261-5v78-b183-1d5f5718fa77.cell.service.cf.internal:1801/state: dial tcp: i/o timeout","session":"834"}}
[RequestLoggerHandler] 2019/12/11 20:29:52 INFO - handlers.DiscoveryHandler Request [28] [b37a41cc-7150-4b18-b183-1d5f5718fa77.cell.service.cf.internal.] 0 410000ns
~# ps -C bosh-dns PID TTY TIME CMD 14553 ? 00:02:09 bosh-dns2. Check if there are any UDP drops by reading /proc/<PID OF BOSH-DNS>/net/udp file and grep for port 53 ( 0x0035 ). The last column in the output "drops" is how many bytes have been dropped. This should normally be 0, however to confirm this is the same problem, this value should be incrementing every time there is a cell state timeout error.
~# cat /proc/14553/net/udp | egrep "local|:0035 000" sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops 796: 0200FEA9:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000 1000 0 1225946 2 ffff96c5b88b4c00 72657
In most cases scailing up cpu on the diego brain will help mitigate this issue. Alternatively you can apply the following workaround where Operators can adjust the kernel rmem_max and rmem_default socket buffers to help avoid UDP frame loss. These change should be applied to all diego_brain VMs. The default values for these parameters are below:
~# sysctl net.core.rmem_max net.core.rmem_max = 212992 ~# sysctl net.core.rmem_default net.core.rmem_default = 212992
You can try doubling the default value of 212992 bytes to 425984 bytes and see if that helps avoid UDP frame loss. You can continue to increase the buffer size until the problem goes away, however we would not advise increasing this buffer larger than 2MB as it could actually contribute to frame loss when the buffer is too large. The smallest possible value is best.
The following is the procedure to increase rmem socket buffer for the bosh-dns process running on the diego_brain:
1. Increase the buffers using sysctl:
sysctl -w net.core.rmem_max=425984 sysctl -w net.core.rmem_default=425984
2. Restart bosh-dns:
monit restart bosh-dns
3. Confirm bosh-dns is using the increased buffer size. In the below command output, skmem has a value of "rb425984" which is the socket read buffer size.
~# ss -m --info -nlu UNCONN 0 0 169.###.#.#:53 *:* skmem:(r33024,rb425984,t0,tb212992,f3840,w0,o0,bl0)
The above procedure will only temporarily increase the socket buffers. If the VM is restarted or recreated these changes will go away. To make these settings permanent, Operators can install the os-conf bosh add-on.