Queries using a Lucene index can return unexpected matches

Products

VMware Tanzu Gemfire

Issue/Introduction

When using a Lucene index, doing a search with a GFSH command or JAVA API call, it may return unexpected results, returning rows containing the query string "Jones" in places other than the start.

GFSH Command:

 gfsh> search lucene --name=indexName --region=/orders --queryString="Jones*" --defaultField=customer

Java API call:

 LuceneQuery<String, Person> query = luceneService.createLuceneQueryFactory()
    .create(indexName, "Orders", "Jones*", "customer"); 
  Collection<Person> results = query.findValues();

The issue is that the default analyzer for a simple string field will tokenize by word rather than simply by the string as a whole. This means each word will be indexed individually. For example, if the value of the customer field is 'word1 word2 word3'. The resulting index would be of the form, "customer: word1,word2 word3". The search 'Jones*' is performed against all of the indexed words, so if any of the words in the customer field starts with "Jones", it will match (note however, that a word like "NotAJones" would not match).

You can see this behavior in the following example:

gfsh>create lucene index --name=IdxTest --region=/test --field=__REGION_VALUE_FIELD
               Member                | Status
------------------------------------ | ---------------------------------
192.168.3.2(server1:47018)<v1>:41001 | Successfully created lucene index

gfsh>create region --name=test --type=PARTITION_REDUNDANT
Member  | Status | Message
------- | ------ | ---------------------------------
server1 | OK     | Region "/JJ" created on "server1"

Cluster configuration for group 'cluster' is updated.

gfsh>put --key='1' --value='abc Jones123 Jones456' --region=/test
Result      : true
Key Class   : java.lang.String
Key         : 1
Value Class : java.lang.String
Old Value   : null

gfsh>put --key='2' --value='abcJones123Jones456' --region=/test
Result      : true
Key Class   : java.lang.String
Key         : 2
Value Class : java.lang.String
Old Value   : null

gfsh>put --key='3' --value='Jones123456' --region=/test
Result      : true
Key Class   : java.lang.String
Key         : 3
Value Class : java.lang.String
Old Value   : null

gfsh>search lucene --name=IdxTest --region=/test --queryString=Jones* --defaultField=__REGION_VALUE_FIELD
key |         value         | score
--- | --------------------- | -----
3   | Jones123456           | 1.0
1   | abc Jones123 Jones456 | 1.0

Environment

Product Version: 9.10

Resolution

In order to achieve the desired behavior, you need to write a custom field analyzer to keep the whole field together. An example of such a custom analyzer is given below:

import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.TokenStream; 
import org.apache.lucene.analysis.Tokenizer; 
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents; 
import org.apache.lucene.analysis.core.LowerCaseFilter; 
import org.apache.lucene.analysis.util.CharTokenizer; 

public class MyCharacterAnalyzer extends Analyzer { 
  @Override 
  protected TokenStreamComponents createComponents(String fieldName) { 
    Tokenizer tokenizer = new MyCharacterTokenizer(); 
    TokenStream filter = new LowerCaseFilter(tokenizer); 
    return new TokenStreamComponents(tokenizer, filter); 
  } 

  private static class MyCharacterTokenizer extends CharTokenizer { 
    @Override 
    protected boolean isTokenChar(final int character) { 
      return true;
    }  
  } 
}