When the upstream endpoints are down, you may see otel-collector process being restarted frequently due to out of memory errors. These are the symptoms.
2025-10-23T04:05:15.758Z warn [email protected]/clientconn.go:1379 [core] [Channel #1 SubChannel #5]grpc: addrConn.createTransport failed to connect to {Addr: "192.168.5.138:4317", ServerName: "192.168.5.138:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp ###.###.###.###:4317: connect: connection refused" {"grpc_log": true}
2025-10-23T04:05:21.185Z info internal/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp ###.###.###.###:4317: connect: connection refused\"", "interval": "25.977785579s"}
otel-collector.stderr.log.
2025-10-03T08:33:02.430Z info internal/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "metrics", "name": "otlp/mm-nonprod", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\"", "interval": "21.974611857s"}
2025-10-03T08:33:03.503Z warn [email protected]/clientconn.go:1379 [core] [Channel #2 SubChannel #6]grpc: addrConn.createTransport failed to connect to {Addr: "###.###.###.###:443", ServerName: "###.###.###.###:443", }. Err: connection error: desc = "transport: authentication handshake failed: tls: first record does not look like a TLS handshake" {"grpc_log": true}
otel-collector.stderr.log.
2025-10-03T04:18:47.314Z info [email protected]/service.go:208 Starting cf-otel-collector... {"Version": "0.11.4", "NumCPU": 4}
2025-10-03T04:18:47.314Z info extensions/extensions.go:39 Starting extensions...
2025-10-03T04:18:47.314Z info [email protected]/otlp.go:112 Starting GRPC server {"kind": "receiver", "name": "otlp/cf-internal-local", "data_type": "metrics", "endpoint": "127.0.0.1:9100"}
.
.
.
2025-10-03T04:23:59.006Z info [email protected]/service.go:208 Starting cf-otel-collector... {"Version": "0.11.4", "NumCPU": 4}
2025-10-03T04:23:59.006Z info extensions/extensions.go:39 Starting extensions...
2025-10-03T04:23:59.010Z info [email protected]/otlp.go:112 Starting GRPC server {"kind": "receiver", "name": "otlp/cf-internal-local", "data_type": "logs", "endpoint": "127.0.0.1:9100"}
2025-10-02T20:23:13.507668+00:00 ########-####-####-####-######## kernel: otel-collector invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 ./syslog:2025-10-02T20:23:13.507779+00:00 5ed86e45-da6f-497d-b1fb-125546ac75cd kernel: oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=runc-bpm-otel-collector.scope,mems_allowed=0,oom_memcg=/system.slice/runc-bpm-otel-collector.scope,task_memcg=/system.slice/runc-bpm-otel-collector.scope,task=otel-collector,pid=3000726,uid=1000 ./syslog:2025-10-02T20:23:13.507780+00:00 5ed86e45-da6f-497d-b1fb-125546ac75cd kernel: Memory cgroup out of memory: Killed process 3000726 (otel-collector) total-vm:1739492kB, anon-rss:521712kB, file-rss:27228kB, shmem-rss:0kB, UID:1000 pgtables:1176kB oom_score_adj:0
otel-collector 0.11.4 release
Add the following to limit the memory used in your otel-collector config (E.g TPCF/TAS -> System Logging -> OpenTelemetry Collector Configuration.), should solve the issue.
processors:
# Keep the collector well under the cgroup/BPM limit so it never OOMs
memory_limiter:
check_interval: 2s
limit_mib: 380 # ≈ 75–80% of Go heap cap if GOMEMLIMIT≈409 MiB; tune to your limits
spike_limit_mib: 64
Future versions will include those settings by default to avoid the issue.