Errors in admin.log after recreatin indexes

book

Article ID: 125287

calendar_today

Updated On:

Products

Clarity PPM On Premise

Issue/Introduction


I implemented the steps suggested in KB000019522 to perform Filestore Reindex and I am able to access documents without any issue.
The admin.log shows below errors. 

1) (admin) Attempt to output character of integral value 0 that is not represented in specified output encoding of UTF-8. 

2) (admin) [Fatal Error] :2:3: The markup in the document preceding the root element must be well-formed. 

3) (admin) java.lang.Throwable: Warning: You did not close the PDF Document 
1/23/19 11:01 AM (admin) at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:420) 
1/23/19 11:01 AM (admin) at java.lang.System$2.invokeFinalize(System.java:1270) 
1/23/19 11:01 AM (admin) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:98) 
1/23/19 11:01 AM (admin) at java.lang.ref.Finalizer.access$100(Finalizer.java:34) 
1/23/19 11:01 AM (admin) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:213) 

4) (admin) java.lang.NoSuchMethodException: org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTPictureBaseImpl.<init>(org.apache.xmlbeans.Sch 
emaType, boolean) 
1/23/19 11:10 AM (admin) at java.lang.Class.getConstructor0(Class.java:3082) 

Environment

Release: All Supported Releases
Component: PPMCOL

Resolution

They are 'acceptable' failures. Below is the explanation why: 
  • Our search capabilities are text based (i.e. there is a search box where words are entered that are used as search terms to match against the text in the documents).
  • The indexer therefore determines the file type (text, MS Word, MS Excel, PDF, etc.) and then uses code that can read the file and extract all the text that can be indexed and used as search terms.
  • Sometimes these files also contain non-text data:
    • It may be embedded binary content (digital signatures, passwords, encrypted data like protected worksheets, etc.)
    • or simply be a 'format' or 'version' of the document type that our file-type parser cannot handle.
    • In other cases it can be genuine that a few bytes or characters in a file can be interpreted one of two ways
      • what might look like a UTF byte sequence to begin with, may in fact just randomly use the same byte/characters for something els).
  • In all of these kinds of cases listed above, while an error is reported, and while this means a part or all of a document might not be indexed, they are considered 'acceptable' failures because the content of the document that couldn't be indexed was non-text.
  • This is also true of the NoSuchMethodException one, it's probably the most clear example to us of the problem that occurs when a newer file format version is being used than our libraries can index (e.g. some Office 2015 document or feature or addin in a Word doc that is too new for our version of PPM to parse and index).