search cancel

PDFExtractor 'Failed to initialize extraction instance' in ContentExtractionHost_FileReader logs

book

Article ID: 215681

calendar_today

Updated On:

Products

Data Loss Prevention Data Loss Prevention Network Email

Issue/Introduction

You want to understand the following warning and errors in the ContentExtractionHost_FileReader logs:

05/21/21 11:53:08 | WARN  | cehost | ImageExtractorPlugin [6132] | [11932] | Failed to initialize extraction instance. | PdfExtractor.cpp (82)
05/21/21 11:53:08 | ERROR | cehost | ImageExtractorPlugin [6132] | [11932] | File type can not be identified | ImageExtractorPluginLib.cpp (198)
05/21/21 11:53:08 | INFO  | cehost | FileTypeIdentifierRequestExecutor [6132] | [11932] | Plugin File Type Identification from stream failed in plugin ImageExtractorPlugin with retval = 1, Exception thrown from : SPIExecutor.cpp(143) | FileTypeIdentifierRequestExecutor.cpp (149)

Environment

DLP 15.x

Cause

One cause for this is when encrypted PDF files are sent to CEH (content extraction host) but there is no plugin configured that can decrypt them.

Resolution

To determine if the cause is that the files are encrypted, enable trace logging in the log4cxx_config_filereader.xml file, found under C:\Program Files\Symantec\DataLossPrevention\DetectionServer\<ver>\Protect\config\

Update the following nodes within the XML, then restart the Symantec DLP Detection Server service:

 <appender name="cehostAppender" class="org.apache.log4j.RollingFileAppender">
  <param name="file" value="C:/ProgramData/Symantec/DataLossPrevention/DetectionServer/15.7/logs/debug/ContentExtractionHost_FileReader.log" />
  <param name="append" value="true" />
<param name="MaxFileSize" value="10240KB" />
<param name="MaxBackupIndex" value="100" />
  <layout class="org.apache.log4j.PatternLayout">
   <param name="ConversionPattern" value="%d{%m/%d/%y %H:%M:%S} | %-5p | %c{2} | %m%n" />
  </layout>
 </appender>

 <category name="cehost" >
<priority value ="trace" />
  <appender-ref ref="cehostAppender"/>
 </category>
Sample log output showing that a particular PDF file is encrypted:

05/21/21 11:53:08 | TRACE | cehost | Service [6132] | [11932] | Start of Request Id #50701 CEService Id #34 CE Request Type: FILETYPE_IDENTIFICATION | CEService.cpp (985)
05/21/21 11:53:08 | WARN  | cehost | ImageExtractorPlugin [6132] | [11932] | Failed to initialize extraction instance. | PdfExtractor.cpp (82)
05/21/21 11:53:08 | ERROR | cehost | ImageExtractorPlugin [6132] | [11932] | File type can not be identified | ImageExtractorPluginLib.cpp (198)
05/21/21 11:53:08 | TRACE | cehost | SPIExecutor [6132] | [11932] | Finished call to DetectTypeFromStream for plugin ImageExtractorPlugin in 42 milliseconds. | SPIExecutorScopedOperationLogger.cpp (45)
05/21/21 11:53:08 | INFO  | cehost | FileTypeIdentifierRequestExecutor [6132] | [11932] | Plugin File Type Identification from stream failed in plugin ImageExtractorPlugin with retval = 1, Exception thrown from : SPIExecutor.cpp(143) | FileTypeIdentifierRequestExecutor.cpp (149)
05/21/21 11:53:08 | TRACE | cehost | SPIExecutor [6132] | [11932] | Calling DetectTypeFromStream for plugin Verity. | SPIExecutorScopedOperationLogger.cpp (36)
05/21/21 11:53:08 | TRACE | cehost | Verity [6132] | [11932] | Document is encrypted | ..\DocumentIdentificationHelper.c (201)
05/21/21 11:53:08 | TRACE | cehost | Verity [6132] | [11932] | Failed to open stream | src\VerityImplInternal.c (452)
05/21/21 11:53:08 | TRACE | cehost | Service [6132] | [11932] | End of Request Id #50701 CEService Id #34 CE Request Type: FILETYPE_IDENTIFICATION Execution Time=124 MilliSeconds | CEService.cpp (995)