本文介绍Lucene/ElasticSearch/Solr中的DisjunctionMaxQuery,这里我先给出Lucene 8.2.0版本JavaDoc对于该查询接口的描述:

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries. This is useful when searching for a word in multiple fields with different boost factors (so that the fields cannot be combined equivalently into a single search field). We want the primary score to be the one associated with the highest boost, not the sum of the field scores (as BooleanQuery would give). If the query is "albino elephant" this ensures that "albino" matching one field and "elephant" matching another gets a higher score than "albino" matching both fields. To get this result, use both BooleanQuery and DisjunctionMaxQuery: for each term a DisjunctionMaxQuery searches for it in each field, while the set of these DisjunctionMaxQuery's is combined into a BooleanQuery. The tie breaker capability allows results that include the same term in multiple fields to be judged better than results that include this term in only the best of those multiple fields, without confusing this with the better case of two different terms in the multiple fields.

如果你已经知道DisjunctionMaxQuery的含义,就很容易理解上面这段话:该查询生成多个子查询的合集,对于一个文档,如果同时匹配多个子查询,则取其中评分最高的那个子查询的评分作为每个文档的最终评分。有些绕,直接通过例子来看这个查询是用来解决什么问题的。看完之后,你也就明白上面再说什么了。

一个例子

为了方便,这里以ES为例进行说明。先创建一个名为 dis-max-test 的索引,并插入2条文档,每个文档包含一个 nameintroduction 字段:

// 插入数据
PUT dis-max-test/_bulk
{ "index": {}}
{ "name": "William Henry Gates III, Bill Gates", "introduction": "Founder of Microsoft Corporation."}        // 第一条数据:Bill Gates的信息
{ "index": {}}
{ "name": "Melinda Gates", "introduction": "Wife of Gates, a former general manager at Microsoft."}            // 第二条数据:Melinda Gates的信息

假设现在我们想搜索和“Bill Gates”相关的内容,则可以通过如下语句方式进行搜索:

# 搜索语句
GET dis-max-test/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
      ]
    }
  }
}

上面这个语句的含义是搜索name或者introduction字段里面包含“Bill Gates”的文档,其查询结果如下:

# 搜索结果
{
  "took" : 4,
  "timed_out" : false,
  "hits" : {
    "max_score" : 0.8281169,
    "hits" : [
      {
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "Ge7O3W0BYOeS6h1DGlUi",
        "_score" : 0.8281169,        # 评分
        "_source" : {
          "name" : "Melinda Gates",
          "introduction" : "Wife of Gates, a former general manager at Microsoft."
        }
      },
      {
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "GO7O3W0BYOeS6h1DGlUi",
        "_score" : 0.7952278,         # 评分
        "_source" : {
          "name" : "William Henry Gates III, Bill Gates",
          "introduction" : "Founder of Microsoft Corporation."
        }
      }
    ]
  }
}

这个搜索结果是正确的,Match搜索的时候会把“Bill Gates”先分词,结果是BillGates,搜索的结果里面也都至少其中一个。但是有一点让人不是很满意,按照我们的搜索意图,上面的第二条结果才更贴近,因为它里面包含完整的“Bill Gates”。但结果它的评分却比第一条低(即匹配度低),排在了后面。在分析原因之前,我们换成Lucene的DisjunctionMaxQuery(在ES里面叫dis_max)来查询一下:

# 查询语句
GET dis-max-test/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
        ]
    }
  }
}

dis_max由多个match组成,其查询条件和上面的bool-should相同,看下查询结果:

{
  "took" : 3,
  "timed_out" : false,
  "hits" : {
    "max_score" : 0.7952278,
    "hits" : [
      {
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "GO7O3W0BYOeS6h1DGlUi",
        "_score" : 0.7952278,
        "_source" : {
          "name" : "William Henry Gates III, Bill Gates",
          "introduction" : "Founder of Microsoft Corporation."
        }
      },
      {
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "Ge7O3W0BYOeS6h1DGlUi",
        "_score" : 0.59891266,
        "_source" : {
          "name" : "Melinda Gates",
          "introduction" : "Wife of Gates, a former general manager at Microsoft."
        }
      }
    ]
  }
}

可以看到,查询结果与之前的一样,区别在于完全包含“Bill Gates”一词的那条文档排在了前面,因为它的评分高于Melinda Gates的那条文档,这个结果也正是我们想要的。看到这里,你应该已经有一点感觉了,虽然dis_max和boolean-should的查询条件相近,但其对于结果的评分却不一样,似乎dis_max更贴近我们的搜索意图。下面来探索一下造成这种差别的原因。

原理分析

ES的查询中支持一个explain的参数,如果将其设置为true的话,查询结果中就会额外输出计算得分的过程(_explanation 部分)。该参数默认是false的,我们将其改为true,然后再执行一下上面的两个查询,来看看造成两种不同结果背后的细节。

先看Boolean-should查询:

# 查询语句
GET dis-max-test/_search
{
  "explain": true, 
  "query": {
    "bool": {
      "should": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
      ]
    }
  }
}

# 查询结果
{
  "took" : 5,
  "timed_out" : false,
  "hits" : {
    "max_score" : 0.8281169,
    "hits" : [
      {
        "_shard" : "[dis-max-test][0]",
        "_node" : "aIzM2bJFT_afjUgEMxWosg",
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "Ge7O3W0BYOeS6h1DGlUi",
        "_score" : 0.8281169,
        "_source" : {
          "name" : "Melinda Gates",
          "introduction" : "Wife of Gates, a former general manager at Microsoft."
        },
        "_explanation" : {
          "value" : 0.8281169,
          "description" : "sum of:",        # 注意这里!!!
          "details" : [
            {
              "value" : 0.22920427,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.22920427,
                  "description" : "weight(name:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            },
            {
              "value" : 0.59891266,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.59891266,
                  "description" : "weight(introduction:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[dis-max-test][0]",
        "_node" : "aIzM2bJFT_afjUgEMxWosg",
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "GO7O3W0BYOeS6h1DGlUi",
        "_score" : 0.7952278,
        "_source" : {
          "name" : "William Henry Gates III, Bill Gates",
          "introduction" : "Founder of Microsoft Corporation."
        },
        "_explanation" : {
          "value" : 0.7952278,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.7952278,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.5754429,
                  "description" : "weight(name:bill in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                },
                {
                  "value" : 0.21978492,
                  "description" : "weight(name:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

为了节省篇幅以及看的更清楚,省略了计算评分的细节,这部分后面有单独的文章介绍。在这个查询中,Melinda Gates对应文档的评分0.8281169,高于Bill Gates对应文档的评分0.7952278,即对于这个查询而言,ES认为Melinda Gates对应文档比Bill Gates对应文档更贴近我们的搜索词“Bill Gates”。其原因是这样的:

  • 对于Melinda Gates对应文档而言,它的评分0.8281169是由下面details数组里面两个子查询的评分0.22920427和0.59891266两个评分加来的(即description字段的"sum of" 含义):0.22920427这个评分是name字段中包含了Gates这个搜索词而得的,0.59891266这个评分是introduction字段中包含也包含Gates而得的。
  • 对于Bill Gates对应的文档而言,它的评分0.7952278是由下面的0.5754429和0.21978492相加而来。0.5754429是name中包含bill获得的,0.21978492是name中包含gates获得的。introduction字段中没有匹配项,所以没有得分。

这样我们就明白了为什么虽然Bill Gates的文档更贴近搜索意图,其评分却低的原因。因为对于Boolean查询而言,其总评分是多个子查询的评分相加而来的(上面查询结果json中details数组里面一个元素代表一个查询结果)。Melinda Gates文档中虽然没有bill,但却包含多个Gates,所以累加下来总评分就高。但实际中对于有些场景通过这种累加所有子查询的结果并不能准确的代表查询意图,就好比三个臭皮匠很多时候是顶不了一个诸葛亮的。

为了解决这个问题,就产生了本文的主角DisjunctionMaxQuery,看下面查询:

GET dis-max-test/_search
{
  "explain": true, 
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
        ]
    }
  }
}

# 查询结果
{
  "took" : 5,
  "timed_out" : false,
  "hits" : {
    "max_score" : 0.7952278,
    "hits" : [
      {
        "_shard" : "[dis-max-test][0]",
        "_node" : "aIzM2bJFT_afjUgEMxWosg",
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "GO7O3W0BYOeS6h1DGlUi",
        "_score" : 0.7952278,
        "_source" : {
          "name" : "William Henry Gates III, Bill Gates",
          "introduction" : "Founder of Microsoft Corporation."
        },
        "_explanation" : {
          "value" : 0.7952278,
          "description" : "max of:",        # 注意这里
          "details" : [
            {
              "value" : 0.7952278,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.5754429,
                  "description" : "weight(name:bill in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                },
                {
                  "value" : 0.21978492,
                  "description" : "weight(name:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[dis-max-test][0]",
        "_node" : "aIzM2bJFT_afjUgEMxWosg",
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "Ge7O3W0BYOeS6h1DGlUi",
        "_score" : 0.59891266,
        "_source" : {
          "name" : "Melinda Gates",
          "introduction" : "Wife of Gates, a former general manager at Microsoft."
        },
        "_explanation" : {
          "value" : 0.59891266,
          "description" : "max of:",       # 注意这里
          "details" : [
            {
              "value" : 0.22920427,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.22920427,
                  "description" : "weight(name:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            },
            {
              "value" : 0.59891266,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.59891266,
                  "description" : "weight(introduction:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

这个和上边的类似,但是dis_max在计算最终评分的时候并不是累加各个匹配的子查询,而是取评分最高的子查询结果作为最终结果(即"description" : "max of:",这里注意区分一下,DisjunctionMaxQuery这一层取max,而子查询内层依旧使用的是sum的方式来计算评分)。

到这里,我们就明白DisjunctionMaxQuery查询的含义了,它和BooleanQuery类似,也由多个子查询组成。BooleanQuery计算一个匹配文档的总评分时,是累加所有子查询的评分,而DisjunctionMaxQuery则是取评分最高的那个子查询的评分作为文档的最终得分。

还拿臭皮匠为例,如果说诸葛亮的IQ是145,而三个臭皮匠的IQ分别为91,85,84。如果你问BooleanQuery是诸葛亮聪明还是三个臭皮匠聪明,那它会告诉你三个臭皮匠聪明,因为诸葛亮IQ是145,而三个臭皮匠的IQ是91+85+84=260。显然这样是不对的。但如果你问DisjunctionMaxQuery同样的问题,它则会告诉你诸葛亮聪明,因为诸葛亮的IQ是145,而三个臭皮匠的IQ是91.

当然呢,有时候人多还是力量大的。三个臭皮匠在一起不一定能胜过一个诸葛亮,但一般还是可以胜过他们之中任意一个人的,所以直接取最高的,忽略掉另外两个人的IQ有时候也不太合适,特别是他们如果技能领域各不相同的话。所以DisjunctionMaxQuery又提供了一个tie_breaker参数,该参数合法值范围为[0, 1],默认取0. 计算最终得分的时候,DisjunctionMaxQuery会取最高分,同时加上各个子查询的得分乘以tie_breaker的值。即不是像BooleanQuery那样粗暴相加,而是给非最高分的评分给一个权重,毕竟量变可能会引起质变,完全忽略也不是很合适。至于tie_breaker该设置多少,这个需要结合具体的使用场景。

还是上面的dis_max查询,但我们将tie_breaker由默认值0改为0.9,会发现它的查询结果也发生了变化:

# 查询
GET dis-max-test/_search
{
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
        ],
        "tie_breaker": 0.9
    }
  }
}

# 查询结果
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.80519646,
    "hits" : [
      {
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "Ge7O3W0BYOeS6h1DGlUi",
        "_score" : 0.80519646,
        "_source" : {
          "name" : "Melinda Gates",
          "introduction" : "Wife of Gates, a former general manager at Microsoft."
        }
      },
      {
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "GO7O3W0BYOeS6h1DGlUi",
        "_score" : 0.7952278,
        "_source" : {
          "name" : "William Henry Gates III, Bill Gates",
          "introduction" : "Founder of Microsoft Corporation."
        }
      }
    ]
  }
}

使用explain查看上述查询评分的过程:

GET dis-max-test/_search
{
  "explain": true, 
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
        ],
        "tie_breaker": 0.9
    }
  }
}

GET dis-max-test/_search
{
  "explain": true, 
  "query": {
    "dis_max": {
      "queries": [
        { "match": { "name": "Bill Gates"} },
        { "match": { "introduction": "Bill Gates"}}
        ],
        "tie_breaker": 0.9
    }
  }
}


# 查询结果
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.80519646,
    "hits" : [
      {
        "_shard" : "[dis-max-test][0]",
        "_node" : "aIzM2bJFT_afjUgEMxWosg",
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "Ge7O3W0BYOeS6h1DGlUi",
        "_score" : 0.80519646,
        "_source" : {
          "name" : "Melinda Gates",
          "introduction" : "Wife of Gates, a former general manager at Microsoft."
        },
        "_explanation" : {
          "value" : 0.80519646,
          "description" : "max plus 0.9 times others of:",
          "details" : [
            {
              "value" : 0.22920427,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.22920427,
                  "description" : "weight(name:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            },
            {
              "value" : 0.59891266,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.59891266,
                  "description" : "weight(introduction:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[dis-max-test][0]",
        "_node" : "aIzM2bJFT_afjUgEMxWosg",
        "_index" : "dis-max-test",
        "_type" : "_doc",
        "_id" : "GO7O3W0BYOeS6h1DGlUi",
        "_score" : 0.7952278,
        "_source" : {
          "name" : "William Henry Gates III, Bill Gates",
          "introduction" : "Founder of Microsoft Corporation."
        },
        "_explanation" : {
          "value" : 0.7952278,
          "description" : "max plus 0.9 times others of:",
          "details" : [
            {
              "value" : 0.7952278,
              "description" : "sum of:",
              "details" : [
                {
                  "value" : 0.5754429,
                  "description" : "weight(name:bill in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                },
                {
                  "value" : 0.21978492,
                  "description" : "weight(name:gates in 0) [PerFieldSimilarity], result of:",
                  "details" : [/* 省略计算得分的细节 */]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

本文就介绍到这里。

References