ES 分词搜索问题，中文+数字

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 944 天前的主题，其中的信息可能已经有所发展或是发生改变。

版本：7.17.2

分词测试

Request
{
  "field": "cn_name", 
  "text": "山崎 12"
}

Response
{
  "tokens" : [
    {
      "token" : "山崎",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "12",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "ARABIC",
      "position" : 1
    }
  ]
}

查询

Request
{
  "profile": true,
  "explain": true,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "cn_name": {
              "query": "山崎 12"
            }
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 10
}

Response
....
"_explanation" : {
          "value" : 9.302625,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 9.302625,
              "description" : "weight(cn_name:12 in 13135) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 9.302625,
                  "description" : "score(freq=1.0), computed as boost * idf * tf from:",

只能命中数字 12 ，山崎不能命中,profile 的查询条件是有山崎和 12

"profile" : {
    "shards" : [
      {
        "id" : "[x-x][_tables][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "BooleanQuery",
                "description" : "cn_name:山崎 cn_name:12",

Request
{
  "profile": true,
  "explain": true,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "cn_name": {
              "query": "山崎 12",
              "operator": "and"
            }
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 10
}

添加了operator参数做测试，但什么结果都匹配不到。搜索山崎 12 年就能匹配到。想问下大佬我需要再做什么测试验证，从哪方便找问题呢？

山崎

cn_name

query

profile

12 条回复 • 2022-07-25 14:52:03 +08:00

bxb100

2022-07-25 11:11:33 +08:00

你看下 search analysis 是不是 standrad

Morriaty

2022-07-25 11:15:01 +08:00

GET <index_name>/_validate/query?rewrite=true 能看到是怎么拆 term 的

fengci

2022-07-25 11:20:59 +08:00

@bxb100 我指定过 analysis 。

fengci

2022-07-25 11:21:44 +08:00

@Morriaty

```
Request
{
"query": {
"match": {
"cn_name": {
"query": "山崎 12"
}
}
}
}

Response
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"valid" : true,
"explanations" : [
{
"index" : "whiskey_depot_1658716667",
"valid" : true,
"explanation" : "cn_name:山崎 cn_name:12"
}
]
}
```

fengci

2022-07-25 11:31:28 +08:00

山崎是做了自定义词典。字段只做了 ik_smart 分词，没做其他过滤，搜索和录入用的都是 ik_smart 。

misaka19000

2022-07-25 11:34:43 +08:00

是不是索引搞错了

fengci

2022-07-25 11:40:02 +08:00

@misaka19000 应该是原始内容录入的时候没有 12 这个分词。他是 12 年作为分词了。raw: 山崎 12 年金花标单一麦芽威士忌

{
"tokens" : [
{
"token" : "山崎",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "12 年",
"start_offset" : 3,
"end_offset" : 6,
"type" : "TYPE_CQUAN",
"position" : 1
},
{
"token" : "金花",
"start_offset" : 7,
"end_offset" : 9,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "标",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_CHAR",
"position" : 3
},
{
"token" : "单一麦芽",
"start_offset" : 11,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 4
},
{
"token" : "威士忌",
"start_offset" : 15,
"end_offset" : 18,
"type" : "CN_WORD",
"position" : 5
}
]
}

但是我之前是对 12 做了单独的词典的，刚测试才把数字的词典删掉

novolunt

2022-07-25 13:05:24 +08:00

{
"tokens" : [
{
"token" : "山崎",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "12",
"start_offset" : 3,
"end_offset" : 5,
"type" : "ARABIC",
"position" : 1
},
{
"token" : "年",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 2
},
{
"token" : "金花",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 3
},
{
"token" : "标",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",
"position" : 4
},
{
"token" : "单一",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 5
},
{
"token" : "麦芽",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 6
},
{
"token" : "威士忌",
"start_offset" : 16,
"end_offset" : 19,
"type" : "CN_WORD",
"position" : 7
},
{
"token" : "1990",
"start_offset" : 19,
"end_offset" : 23,
"type" : "ARABIC",
"position" : 8
}
]
}

novolunt

2022-07-25 13:07:22 +08:00

bin/elasticsearch-plugin -v install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.17.2/elasticsearch-analysis-ik-7.17.2.zip

fengci

2022-07-25 14:01:48 +08:00

@novolunt 谢谢，找到问题了。之前多做了一步，把 12 年加到字典里面，然后原始内容，入库的时候分隔得到的是 12 年没有 12 。

WhereverYouGo

2022-07-25 14:46:03 +08:00

试下 terms 匹配

fengci

2022-07-25 14:52:03 +08:00

@WhereverYouGo raw:山崎 12 年金花标单一麦芽威士忌 ,是我自己自定义词典有个数字+年。所以建文档的时候数据只有 12 年，没有 12 的分词。所以搜索不出来。谢谢