模糊查询
-
前缀搜索:prefix
概念:以xx开头的搜索,不计算相关度评分。
注意:
- 前缀搜索匹配的是term,而不是field。
- 前缀搜索的性能很差
- 前缀搜索没有缓存
- 前缀搜索尽可能把前缀长度设置的更长
语法:
GET <index>/_search {"query": {"prefix": {"<field>": {"value": "<word_prefix>"}}} } index_prefixes: 默认 "min_chars" : 2, "max_chars" : 5
#prefix: 前缀搜索
DELETE my_index
# elasticsearch stack
# elasticsearch search
# el
# ela
# elas elasticsearch
PUT my_index
{"mappings": {"properties": {"text": {"analyzer": "ik_max_word","type": "text","index_prefixes":{"min_chars":2,"max_chars":4},"fields": {"keyword": {"type": "keyword","ignore_above": 256}}}}}
}
GET my_index/_mapping
POST /my_index/_bulk?filter_path=items.*.error
{"index":{"_id":"1"}}
{"text":"城管打电话喊商贩去摆摊摊"}
{"index":{"_id":"2"}}
{"text":"笑果文化回应商贩老农去摆摊"}
{"index":{"_id":"3"}}
{"text":"老农耗时17年种出椅子树"}
{"index":{"_id":"4"}}
{"text":"夫妻结婚30多年AA制,被城管抓"}
{"index":{"_id":"5"}}
{"text":"黑人见义勇为阻止抢劫反被铐住"}
GET my_index/_search
GET my_index/_mapping
GET _analyze
{"text": ["夫妻结婚30多年AA制,被城管抓"]
}
GET my_index/_search
{"query": {"prefix": {"text": {"value": "城管"}}}
}
-
通配符:wildcard
概念:通配符运算符是匹配一个或多个字符的占位符。例如,*通配符运算符匹配零个或多个字符。您可以将通配符运算符与其他字符结合使用以创建通配符模式。
注意:
- 通配符匹配的也是term,而不是field
语法:
GET <index>/_search {"query": {"wildcard": {"<field>": {"value": "<word_with_wildcard>"}}} }
# 通配符
DELETE my_index
POST /my_index/_bulk
{ "index": { "_id": "1"} }
{ "text": "my english" }
{ "index": { "_id": "2"} }
{ "text": "my english is good" }
{ "index": { "_id": "3"} }
{ "text": "my chinese is good" }
{ "index": { "_id": "4"} }
{ "text": "my japanese is nice" }
{ "index": { "_id": "5"} }
{ "text": "my disk is full" }
DELETE product_en
POST /product_en/_bulk
{ "index": { "_id": "1"} }
{ "title": "my english","desc" : "shouji zhong de zhandouji","price" : 3999, "tags": [ "xingjiabi", "fashao", "buka", "1"]}
{ "index": { "_id": "2"} }
{ "title": "xiaomi nfc phone","desc" : "zhichi quangongneng nfc,shouji zhong de jianjiji","price" : 4999, "tags": [ "xingjiabi", "fashao", "gongjiaoka" , "asd2fgas"]}
{ "index": { "_id": "3"} }
{ "title": "nfc phone","desc" : "shouji zhong de hongzhaji","price" : 2999, "tags": [ "xingjiabi", "fashao", "menjinka" , "as345"]}
{ "title": { "_id": "4"} }
{ "text": "xiaomi erji","desc" : "erji zhong de huangmenji","price" : 999, "tags": [ "low", "bufangshui", "yinzhicha", "4dsg" ]}
{ "index": { "_id": "5"} }
{ "title": "hongmi erji","desc" : "erji zhong de kendeji","price" : 399, "tags": [ "lowbee", "xuhangduan", "zhiliangx" , "sdg5"]}
GET my_index/_search
GET product_en/_searchGET my_index/_search
{"query": {"wildcard": {"text.keyword": {"value": "my eng*ish"}}}
}
GET product_en/_mapping
#exact value
GET product_en/_search
{"query": {"wildcard": {"tags.keyword": {"value": "men*inka"}}}
}
-
正则:regexp
概念:regexp查询的性能可以根据提供的正则表达式而有所不同。为了提高性能,应避免使用通配符模式,如.或 .?+未经前缀或后缀
语法:
GET <index>/_search {"query": {"regexp": {"<field>": {"value": "<regex>","flags": "ALL",}}} }
#正则
GET product_en/_search
GET product_en/_search
{"query": {"regexp": {"title": "[\\s\\S]*nfc[\\s\\S]*"}}
}
GET product_en/_search
GET product_en/_search
{"query": {"regexp": {"desc": {"value": "zh~dng","flags": "COMPLEMENT"}}}
}
GET product_en/_search
{"query": {"regexp": {"tags.keyword": {"value": ".*<2-3>.*","flags": "INTERVAL"}}}
}
flags
-
ALL
启用所有可选操作符。
-
COMPLEMENT
启用操作符。可以使用对下面最短的模式进行否定。例如
a~bc # matches ‘adc’ and ‘aec’ but not ‘abc’
-
INTERVAL
启用<>操作符。可以使用<>匹配数值范围。例如
foo<1-100> # matches ‘foo1’, ‘foo2’ … ‘foo99’, ‘foo100’
foo<01-100> # matches ‘foo01’, ‘foo02’ … ‘foo99’, ‘foo100’
-
INTERSECTION
启用&操作符,它充当AND操作符。如果左边和右边的模式都匹配,则匹配成功。例如:
aaa.+&.+bbb # matches ‘aaabbb’
-
ANYSTRING
启用@操作符。您可以使用@来匹配任何整个字符串。
您可以将@操作符与&和~操作符组合起来,创建一个“everything except”逻辑。例如:@&~(abc.+) # matches everything except terms beginning with ‘abc’
-
模糊查询:fuzzy
混淆字符 (box → fox) 缺少字符 (black → lack)
多出字符 (sic → sick) 颠倒次序 (act → cat)
语法
GET <index>/_search {"query": {"fuzzy": {"<field>": {"value": "<keyword>"}}} }
# fuzzy:模糊查询
GET product_en/_search
GET product_en/_search
{"query": {"fuzzy": {"desc": {"value": "quangongneng nfc","fuzziness": "2"}}}
}GET product_en/_search
{"query": {"match": {"desc": {"query": "nfe quasdasdasdasd","fuzziness": 1}}}
}
参数:
-
value:(必须,关键词)
-
fuzziness:编辑距离,(0,1,2)并非越大越好,召回率高但结果不准确
-
两段文本之间的Damerau-Levenshtein距离是使一个字符串与另一个字符串匹配所需的插入、删除、替换和调换的数量
-
距离公式:Levenshtein是lucene的,es改进版:Damerau-Levenshtein,
axe=>aex Levenshtein=2 Damerau-Levenshtein=1
-
-
transpositions:(可选,布尔值)指示编辑是否包括两个相邻字符的变位(ab→ba)。默认为true。
-
短语前缀:match_phrase_prefix
match_phrase:
- match_phrase会分词
- 被检索字段必须包含match_phrase中的所有词项并且顺序必须是相同的
- 被检索字段包含的match_phrase中的词项之间不能有其他词项
概念:
match_phrase_prefix与match_phrase相同,但是它多了一个特性,就是它允许在文本的最后一个词项(term)上的前缀匹配,如果 是一个单词,比如a,它会匹配文档字段所有以a开头的文档,如果是一个短语,比如 “this is ma” ,他会先在倒排索引中做以ma做前缀搜索,然后在匹配到的doc中做match_phrase查询,(网上有的说是先match_phrase,然后再进行前缀搜索, 是不对的)
参数
- analyzer 指定何种分析器来对该短语进行分词处理
- max_expansions 限制匹配的最大词项
- boost 用于设置该查询的权重
- slop 允许短语间的词项(term)间隔:slop 参数告诉 match_phrase 查询词条相隔多远时仍然能将文档视为匹配 什么是相隔多远? 意思是说为了让查询和文档匹配你需要移动词条多少次?
原理解析:https://www.elastic.co/cn/blog/found-fuzzy-search#performance-considerations
# match_phrase_prefix
GET product_en/_search
{"query": {"match_phrase": {"desc": "shouji zhong de"}}
}GET product_en/_search
{"query": {"match_phrase_prefix": {"desc": {"query": "de zhong shouji hongzhaji","max_expansions": 50,"slop":3}}}
}GET product_en/_search
{"query": {"match_phrase_prefix": {"desc": {"query": "zhong hongzhaji","max_expansions": 50,"slop": 3}}}
}# source: zhong de hongzhaji
# query: zhong > hongzhaji# source: shouji zhong de hongzhaji
# query: de zhong shouji hongzhaji# de shouji/zhong hongzhaji 1次
# shouji/de zhong hongzhaji 2次
# shouji zhong/de hongzhaji 3次
# shouji zhong de hongzhaji 4次
-
N-gram和edge ngram
tokenizer
GET _analyze {"tokenizer": "ngram","text": "reba always loves me" }
token filter
GET _analyze {"tokenizer": "ik_max_word","filter": [ "ngram" ],"text": "reba always loves me" }
min_gram:创建索引所拆分字符的最小阈值
max_gram:创建索引所拆分字符的最大阈值
ngram:从每一个字符开始,按照步长,进行分词,适合前缀中缀检索
edge_ngram:从第一个字符开始,按照步长,进行分词,适合前缀匹配场景
# ngram 和 edge-ngram
#ngram min_gram =1 “max_gram”: 2
GET _analyze
{
“tokenizer”: “ik_max_word”,
“filter”: [ “edge_ngram” ],
“text”: “reba always loves me”
}
#min_gram =1 “max_gram”: 1
#r a l m
#min_gram =1 “max_gram”: 2
#r a l m
#re al lo me
#min_gram =2 “max_gram”: 3
#re al lo me
#reb alw lov me
PUT my_index
{
“settings”: {
“analysis”: {
“filter”: {
“2_3_edge_ngram”: {
“type”: “edge_ngram”,
“min_gram”: 2,
“max_gram”: 3
}
},
“analyzer”: {
“my_edge_ngram”: {
“type”:“custom”,
“tokenizer”: “standard”,
“filter”: [ “2_3_edge_ngram” ]
}
}
}
},
“mappings”: {
“properties”: {
“text”: {
“type”: “text”,
“analyzer”:“my_edge_ngram”,
“search_analyzer”: “standard”
}
}
}
}
GET /my_index/_mapping
POST /my_index/_bulk
{ “index”: { “_id”: “1”} }
{ “text”: “my english” }
{ “index”: { “_id”: “2”} }
{ “text”: “my english is good” }
{ “index”: { “_id”: “3”} }
{ “text”: “my chinese is good” }
{ “index”: { “_id”: “4”} }
{ “text”: “my japanese is nice” }
{ “index”: { “_id”: “5”} }
{ “text”: “my disk is full” }
GET /my_index/_search
GET /my_index/_mapping
GET /my_index/_search
{
“query”: {
“match_phrase”: {
“text”: “my eng is goo”
}
}
}
PUT my_index2
{
“settings”: {
“analysis”: {
“filter”: {
“2_3_grams”: {
“type”: “edge_ngram”,
“min_gram”: 2,
“max_gram”: 3
}
},
“analyzer”: {
“my_edge_ngram”: {
“type”:“custom”,
“tokenizer”: “standard”,
“filter”: [ “2_3_grams” ]
}
}
}
},
“mappings”: {
“properties”: {
“text”: {
“type”: “text”,
“analyzer”:“my_edge_ngram”,
“search_analyzer”: “standard”
}
}
}
}
GET /my_index2/_mapping
POST /my_index2/_bulk
{ “index”: { “_id”: “1”} }
{ “text”: “my english” }
{ “index”: { “_id”: “2”} }
{ “text”: “my english is good” }
{ “index”: { “_id”: “3”} }
{ “text”: “my chinese is good” }
{ “index”: { “_id”: “4”} }
{ “text”: “my japanese is nice” }
{ “index”: { “_id”: “5”} }
{ “text”: “my disk is full” }
GET /my_index2/_search
{
“query”: {
“match_phrase”: {
“text”: “my eng is goo”
}
}
}
GET _analyze
{
“tokenizer”: “ik_max_word”,
“filter”: [ “ngram” ],
“text”: “用心做皮肤,用脚做游戏”
}