矩阵统计聚合

矩阵统计聚合 #

matrix_stats 矩阵统计聚合是一个多值指标聚合,以矩阵形式为两个或多个字段生成协方差统计。

matrix_stats 聚合不支持脚本。

参数说明 #

matrix_stats 聚合采用以下参数。

参数必需/可选数据类型描述
field必需String用于计算矩阵统计的一组字段。
missing可选Object用于替代缺失值的值。默认情况下,会忽略缺失值。参见缺失值。
mode可选String用作多值或数组字段样本的值。允许的值是 avgminmaxsummedian 。默认是 avg

参考样例 #

以下示例返回数据中 taxful_total_priceproducts.base_price 字段的统计信息:

GET sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "matrix_stats_taxful_total_price": {
      "matrix_stats": {
        "fields": ["taxful_total_price", "products.base_price"]
      }
    }
  }
}

返回内容包含聚合结果:

{
  "took": 250,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4675,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "matrix_stats_taxful_total_price": {
      "doc_count": 4675,
      "fields": [
        {
          "name": "products.base_price",
          "count": 4675,
          "mean": 34.99423943014724,
          "variance": 360.5035285833702,
          "skewness": 5.530161335032689,
          "kurtosis": 131.1630632404217,
          "covariance": {
            "products.base_price": 360.5035285833702,
            "taxful_total_price": 846.6489362233169
          },
          "correlation": {
            "products.base_price": 1,
            "taxful_total_price": 0.8444765264325269
          }
        },
        {
          "name": "taxful_total_price",
          "count": 4675,
          "mean": 75.05542864304839,
          "variance": 2788.1879749835425,
          "skewness": 15.812149139923994,
          "kurtosis": 619.1235507385886,
          "covariance": {
            "products.base_price": 846.6489362233169,
            "taxful_total_price": 2788.1879749835425
          },
          "correlation": {
            "products.base_price": 0.8444765264325269,
            "taxful_total_price": 1
          }
        }
      ]
    }
  }
}

下表描述了返回内容的字段。

统计描述
count用于聚合的文档数量。
mean从样本计算得到的字段平均值。
variance均值偏差的平方,衡量数据分布的离散程度。
skewness衡量分布相对于均值的不对称性指标。
kurtosis衡量分布尾部重量的指标。随着尾部变轻,峰度减小。通过评估峰度和偏度来判断一个总体是否可能呈正态分布。
covariance衡量两个字段之间联合变差的指标。正值表示它们的值朝同一方向变动。
correlation标准化协方差,用于衡量两个字段之间关系强度的指标。可能的值范围从-1 到 1(包含两端),表示完全负相关到完全正相关。值为 0 表示变量之间没有可识别的关系。

缺省值 #

要定义如何处理缺失值,请使用 missing 参数。默认情况下,会忽略缺失值。

例如,创建一个索引,其中文档 1 缺少 gpaclass_grades 字段:

POST _bulk
{ "create": { "_index": "students", "_id": "1" } }
{ "name": "John Doe" }
{ "create": { "_index": "students", "_id": "2" } }
{ "name": "Jonathan Powers", "gpa": 3.85, "class_grades": [3.0, 3.9, 4.0] }
{ "create": { "_index": "students", "_id": "3" } }
{ "name": "Jane Doe", "gpa": 3.52, "class_grades": [3.2, 2.1, 3.8] }

首先,运行不带 missing 参数的 matrix_stats 聚合:

GET students/_search
{
  "size": 0,
  "aggs": {
    "matrix_stats_taxful_total_price": {
      "matrix_stats": {
        "fields": [
          "gpa",
          "class_grades"
        ],
        "mode": "avg"
      }
    }
  }
}

在计算矩阵统计时忽略缺失值:

{
  "took": 5,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "matrix_stats_taxful_total_price": {
      "doc_count": 2,
      "fields": [
        {
          "name": "gpa",
          "count": 2,
          "mean": 3.684999942779541,
          "variance": 0.05444997482300096,
          "skewness": 0,
          "kurtosis": 1,
          "covariance": {
            "gpa": 0.05444997482300096,
            "class_grades": 0.09899998760223136
          },
          "correlation": {
            "gpa": 1,
            "class_grades": 0.9999999999999991
          }
        },
        {
          "name": "class_grades",
          "count": 2,
          "mean": 3.333333333333333,
          "variance": 0.1800000381469746,
          "skewness": 0,
          "kurtosis": 1,
          "covariance": {
            "gpa": 0.09899998760223136,
            "class_grades": 0.1800000381469746
          },
          "correlation": {
            "gpa": 0.9999999999999991,
            "class_grades": 1
          }
        }
      ]
    }
  }
}

要设置缺失字段为 0 ,请将 missing 参数作为键值映射提供。尽管 class_grades 是数组字段,但 matrix_stats 聚合会将多值数字字段展平为每个文档的平均值,因此您必须将单个数字作为缺失值提供:

GET students/_search
{
  "size": 0,
  "aggs": {
    "matrix_stats_taxful_total_price": {
      "matrix_stats": {
        "fields": ["gpa", "class_grades"],
        "mode": "avg",
        "missing": {
          "gpa": 0,
          "class_grades": 0
        }
      }
    }
  }
}

在计算矩阵统计时,会用 0 替换任何缺失的 gpaclass_grades 值:

{
  "took": 23,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "matrix_stats_taxful_total_price": {
      "doc_count": 3,
      "fields": [
        {
          "name": "gpa",
          "count": 3,
          "mean": 2.456666628519694,
          "variance": 4.55363318017324,
          "skewness": -0.688130006360758,
          "kurtosis": 1.5,
          "covariance": {
            "gpa": 4.55363318017324,
            "class_grades": 4.143944374667273
          },
          "correlation": {
            "gpa": 1,
            "class_grades": 0.9970184390038257
          }
        },
        {
          "name": "class_grades",
          "count": 3,
          "mean": 2.2222222222222223,
          "variance": 3.793703722777191,
          "skewness": -0.6323693521730989,
          "kurtosis": 1.5000000000000002,
          "covariance": {
            "gpa": 4.143944374667273,
            "class_grades": 3.793703722777191
          },
          "correlation": {
            "gpa": 0.9970184390038257,
            "class_grades": 1
          }
        }
      ]
    }
  }
}