When using a Lucene index, doing a search with a GFSH command or JAVA API call, it may return unexpected results, returning rows containing the query string "Jones" in places other than the start.
GFSH Command:
gfsh> search lucene --name=indexName --region=/orders --queryString="Jones*" --defaultField=customer
Java API call:
LuceneQuery<String, Person> query = luceneService.createLuceneQueryFactory()
.create(indexName, "Orders", "Jones*", "customer");
Collection<Person> results = query.findValues();
The issue is that the default analyzer for a simple string field will tokenize by word rather than simply by the string as a whole. This means each word will be indexed individually. For example, if the value of the customer field is 'word1 word2 word3'. The resulting index would be of the form, "customer: word1,word2 word3". The search 'Jones*' is performed against all of the indexed words, so if any of the words in the customer field starts with "Jones", it will match (note however, that a word like "NotAJones" would not match).
You can see this behavior in the following example:
gfsh>create lucene index --name=IdxTest --region=/test --field=__REGION_VALUE_FIELD
Member | Status
------------------------------------ | ---------------------------------
192.168.3.2(server1:47018)<v1>:41001 | Successfully created lucene index
gfsh>create region --name=test --type=PARTITION_REDUNDANT
Member | Status | Message
------- | ------ | ---------------------------------
server1 | OK | Region "/JJ" created on "server1"
Cluster configuration for group 'cluster' is updated.
gfsh>put --key='1' --value='abc Jones123 Jones456' --region=/test
Result : true
Key Class : java.lang.String
Key : 1
Value Class : java.lang.String
Old Value : null
gfsh>put --key='2' --value='abcJones123Jones456' --region=/test
Result : true
Key Class : java.lang.String
Key : 2
Value Class : java.lang.String
Old Value : null
gfsh>put --key='3' --value='Jones123456' --region=/test
Result : true
Key Class : java.lang.String
Key : 3
Value Class : java.lang.String
Old Value : null
gfsh>search lucene --name=IdxTest --region=/test --queryString=Jones* --defaultField=__REGION_VALUE_FIELD
key | value | score
--- | --------------------- | -----
3 | Jones123456 | 1.0
1 | abc Jones123 Jones456 | 1.0
In order to achieve the desired behavior, you need to write a custom field analyzer to keep the whole field together. An example of such a custom analyzer is given below:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.Analyzer.TokenStreamComponents;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.util.CharTokenizer;
public class MyCharacterAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer tokenizer = new MyCharacterTokenizer();
TokenStream filter = new LowerCaseFilter(tokenizer);
return new TokenStreamComponents(tokenizer, filter);
}
private static class MyCharacterTokenizer extends CharTokenizer {
@Override
protected boolean isTokenChar(final int character) {
return true;
}
}
}