Symptom:
1. On NSX Application Platform's metrics page, the task failure counts of compact and/or index_parallel tasks increase.
2. On NSX Application Platform's core services page, the disk usages of Analytics and Data Storage grow fast.
Log:
(1) For corrupted data issue:
NOTE :
List the names of all druid historical pods with: napp-k get pod | grep druid-historical
For each druid historical pod, view the logs with: napp-k logs <druid historical pod name>
In the log of druid-historical pod, we're seeing logs like:
2024-06-24T14:07:30,891 INFO [SimpleDataSegmentChangeHandler-0] org.apache.druid.server.coordination.SegmentLoadDropHandler - Loading segment correlated_flow_viz_2024-06-22T16:00:00.000Z_2024-06-22T17:00:00.000Z_2024-06-22T16:00:01.587Z_24
2024-06-24T14:07:30,892 INFO [SimpleDataSegmentChangeHandler-0] org.apache.druid.storage.s3.S3DataSegmentPuller - Pulling index at path[CloudObjectLocation{bucket='druid', path='druid/segments/correlated_flow_viz/2024-06-22T16:00:00.000Z_2024-06-22T17:00:00.000Z/2024-06-22T16:00:01.587Z/24/3fc4d48e-7556-43f4-8b7b-5177c437396b/index.zip'}] to outDir[/data/druid/segment-cache/correlated_flow_viz/2024-06-22T16:00:00.000Z_2024-06-22T17:00:00.000Z/2024-06-22T16:00:01.587Z/24]
2024-06-24T14:07:31,332 WARN [SimpleDataSegmentChangeHandler-0] com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
2024-06-24T14:07:31,332 WARN [SimpleDataSegmentChangeHandler-0] com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
2024-06-24T14:07:31,332 WARN [SimpleDataSegmentChangeHandler-0] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 826ms.
java.util.zip.ZipException: invalid entry CRC (expected 0xecab1705 but got 0xe52aca55)
at java.util.zip.ZipInputStream.readEnd(Unknown Source) ~[?:?]
at java.util.zip.ZipInputStream.read(Unknown Source) ~[?:?]
at org.apache.druid.java.util.common.io.NativeIO.chunkedCopy(NativeIO.java:218) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:356) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.utils.CompressionUtils.lambda$unzip$1(CompressionUtils.java:240) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:129) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.utils.CompressionUtils.unzip(CompressionUtils.java:239) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:107) ~[?:?]
at org.apache.druid.storage.s3.S3LoadSpec.loadSegment(S3LoadSpec.java:61) ~[?:?]
at org.apache.druid.segment.loading.SegmentLocalCacheManager.loadInLocation(SegmentLocalCacheManager.java:343) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.loading.SegmentLocalCacheManager.loadInLocationWithStartMarker(SegmentLocalCacheManager.java:331) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.loading.SegmentLocalCacheManager.loadInLocationWithStartMarkerQuietly(SegmentLocalCacheManager.java:293) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.loading.SegmentLocalCacheManager.loadSegmentWithRetry(SegmentLocalCacheManager.java:272) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.loading.SegmentLocalCacheManager.getSegmentFiles(SegmentLocalCacheManager.java:228) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.loading.SegmentLocalCacheLoader.getSegment(SegmentLocalCacheLoader.java:56) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.server.SegmentManager.getSegmentReference(SegmentManager.java:325) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.server.SegmentManager.loadSegment(SegmentManager.java:268) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:281) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:266) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.server.coordination.SegmentLoadDropHandler.addSegment(SegmentLoadDropHandler.java:343) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.server.coordination.SegmentLoadDropHandler$1.lambda$addSegment$1(SegmentLoadDropHandler.java:572) ~[druid-server-29.0.1.jar:29.0.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
(2) For disk space not enough issue:
NOTE :
List the names of all druid middle manager pods with: napp-k get pod | grep druid-middle-manager
For each druid middle manager pod, view the logs with: napp-k logs <druid middle manager pod name>
In the logs of druid middle manager pod, we're seeing errors like:
2024-06-21T03:48:51,168 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: No space left on device
at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:1055) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:548) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:179) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runSequential(ParallelIndexSupervisorTask.java:1214) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runTask(ParallelIndexSupervisorTask.java:551) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:179) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:478) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:450) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131) ~[guava-32.0.1-jre.jar:?]
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75) ~[guava-32.0.1-jre.jar:?]
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82) ~[guava-32.0.1-jre.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: No space left on device
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:592) ~[guava-32.0.1-jre.jar:?]
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:571) ~[guava-32.0.1-jre.jar:?]
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91) ~[guava-32.0.1-jre.jar:?]
at org.apache.druid.segment.realtime.appenderator.BatchAppenderatorDriver.pushAndClear(BatchAppenderatorDriver.java:151) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.realtime.appenderator.BatchAppenderatorDriver.pushAllAndClear(BatchAppenderatorDriver.java:134) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.InputSourceProcessor.process(InputSourceProcessor.java:122) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:945) ~[druid-indexing-service-29.0.1.jar:29.0.1]
... 13 more
Caused by: java.lang.RuntimeException: java.io.IOException: No space left on device
at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.mergeAndPush(AppenderatorImpl.java:998) ~[druid-server-29.0.1.jar:29.0.1]
at org.apache.druid.segment.realtime.appenderator.AppenderatorImpl.lambda$push$1(AppenderatorImpl.java:786) ~[druid-server-29.0.1.jar:29.0.1]
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.doTransform(AbstractTransformFuture.java:252) ~[guava-32.0.1-jre.jar:?]
at com.google.common.util.concurrent.AbstractTransformFuture$TransformFuture.doTransform(AbstractTransformFuture.java:242) ~[guava-32.0.1-jre.jar:?]
at com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:123) ~[guava-32.0.1-jre.jar:?]
... 3 more
Caused by: java.io.IOException: No space left on device
For the data corruption issue, it is rarely observed. While we don't know the root cause, it's possibly related to infra issues like storage and networking.
For the disk space issue, it typically occurs when there are too many unique flows in a day, so the daily re-indexing tasks cannot efficiently rollup the data, and use up all the disk space.
There's no fix at the moment.
Workaround:
(1) For data corruption issue
1. Find the segment id and datasource of segments that cannot be loaded into druid historical
napp-k exec svc/druid-broker -c druid -- curl -ks -H 'content-type:application/json' https://localhost:8282/druid/v2/sql -d '{"query":"select segment_id,datasource from sys.segments where is_published=true and is_available=false"}'
2. Delete the segments from the response above
napp-k exec svc/druid-coordinator -c druid -- curl -ks -XDELETE -H 'content-type:application/json' -H 'Accept: application/json, text/plain' https://localhost:8281/druid/coordinator/v1/datasources/<datasource>/segments/<segment id>
(2) For disk space issue
1. Get the list of pvcs for druid middle manager
napp-k get pvc | grep druid-middle-manager
2. For each pvc, increase the disk space from 16Gi to 32Gi (Ensure you have enough disk storage for disk operation)
napp-k patch <druid middle manager pvc name> -p '{"spec":{"resources":{"requests":{"storage":"32Gi"}}}}'
3. Restart all druid middle managers pods
napp-k rollout restart sts druid-middle-manager