When using a Lucene index, doing a search with a GFSH command or JAVA API call, it may return unexpected results, returning rows containing the query string "Jones" in places other than the start.
GFSH Command:
gfsh> search lucene --name=indexName --region=/orders --queryString="Jones*" --defaultField=customer
Java API call:
LuceneQuery<String, Person> query = luceneService.createLuceneQueryFactory() .create(indexName, "Orders", "Jones*", "customer"); Collection<Person> results = query.findValues();
The issue is that the default analyzer for a simple string field will tokenize by word rather than simply by the string as a whole. This means each word will be indexed individually. For example, if the value of the customer field is 'word1 word2 word3'. The resulting index would be of the form, "customer: word1,word2 word3". The search 'Jones*' is performed against all of the indexed words, so if any of the words in the customer field starts with "Jones", it will match (note however, that a word like "NotAJones" would not match).
You can see this behavior in the following example:
gfsh>create lucene index --name=IdxTest --region=/test --field=__REGION_VALUE_FIELD Member | Status ------------------------------------ | --------------------------------- 192.168.3.2(server1:47018)<v1>:41001 | Successfully created lucene index gfsh>create region --name=test --type=PARTITION_REDUNDANT Member | Status | Message ------- | ------ | --------------------------------- server1 | OK | Region "/JJ" created on "server1" Cluster configuration for group 'cluster' is updated. gfsh>put --key='1' --value='abc Jones123 Jones456' --region=/test Result : true Key Class : java.lang.String Key : 1 Value Class : java.lang.String Old Value : null gfsh>put --key='2' --value='abcJones123Jones456' --region=/test Result : true Key Class : java.lang.String Key : 2 Value Class : java.lang.String Old Value : null gfsh>put --key='3' --value='Jones123456' --region=/test Result : true Key Class : java.lang.String Key : 3 Value Class : java.lang.String Old Value : null gfsh>search lucene --name=IdxTest --region=/test --queryString=Jones* --defaultField=__REGION_VALUE_FIELD key | value | score --- | --------------------- | ----- 3 | Jones123456 | 1.0 1 | abc Jones123 Jones456 | 1.0
In order to achieve the desired behavior, you need to write a custom field analyzer to keep the whole field together. An example of such a custom analyzer is given below:
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.Analyzer.TokenStreamComponents; import org.apache.lucene.analysis.core.LowerCaseFilter; import org.apache.lucene.analysis.util.CharTokenizer; public class MyCharacterAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String fieldName) { Tokenizer tokenizer = new MyCharacterTokenizer(); TokenStream filter = new LowerCaseFilter(tokenizer); return new TokenStreamComponents(tokenizer, filter); } private static class MyCharacterTokenizer extends CharTokenizer { @Override protected boolean isTokenChar(final int character) { return true; } } }