[pg15.2]一个亿级聊天记录表的调优

下面的 sql 查询非常慢。。。

select count(distinct "public"."tb_groupchat"."chat_id"), count("public"."tb_groupchat"."id"), count(distinct "public"."tb_groupchat"."sender_id") from "public"."tb_groupchat" where ("public"."tb_groupchat"."timestamp" >= $1 and "public"."tb_groupchat"."timestamp" < $2)

表结构：

  "id"                serial8 PRIMARY KEY NOT NULL,
  "chat_id"           int8                NOT NULL,
  "sender_id"         int8                NOT NULL,
  "timestamp"         int8                NOT NULL

explain 的结果

[
  {
    "Plan": {
      "Node Type": "Aggregate",
      "Strategy": "Plain",
      "Partial Mode": "Simple",
      "Parallel Aware": false,
      "Async Capable": false,
      "Startup Cost": 181904.15,
      "Total Cost": 181904.16,
      "Plan Rows": 1,
      "Plan Width": 24,
      "Output": ["count(DISTINCT chat_id)", "count(id)", "count(DISTINCT sender_id)"],
      "Plans": [
        {
          "Node Type": "Index Scan",
          "Parent Relationship": "Outer",
          "Parallel Aware": false,
          "Async Capable": false,
          "Scan Direction": "Forward",
          "Index Name": "idx_timestamp",
          "Relation Name": "tb_groupchat",
          "Schema": "public",
          "Alias": "tb_groupchat",
          "Startup Cost": 0.43,
          "Total Cost": 172205.39,
          "Plan Rows": 1293168,
          "Plan Width": 24,
          "Output": ["id",  "chat_id", "sender_id", "content", "\"timestamp\""],
          "Index Cond": "((tb_groupchat.\"timestamp\" >= '1684944000000'::bigint) AND (tb_groupchat.\"timestamp\" < '1685030400000'::bigint))"
        }
      ]
    },
    "Query Identifier": 6892608323288585066,
    "JIT": {
      "Functions": 5,
      "Options": {
        "Inlining": false,
        "Optimization": false,
        "Expressions": true,
        "Deforming": true
      }
    }
  }
]

不知道各位同学有没有啥高招。。。

superares

2023-06-01 08:33:55 +08:00

来源 ChatGPT 4 ，听上去有点道理：
```
你的查询慢可能是由于几个原因：

1. 你正在计算三个不同的计数，其中两个使用 `DISTINCT`。这将要求数据库执行额外的工作来确认值的唯一性。在大数据集中，这可能会引发性能问题。

2. 你在查询中使用了 `WHERE` 子句，这将对表的所有行进行扫描，以确定哪些行满足条件。这可能需要大量时间，特别是当表的行数非常大时。

考虑到这些问题，我有以下建议：

1. **索引：**根据 `EXPLAIN` 的结果，查询正在使用一个名为 `idx_timestamp` 的索引，但由于你的查询中涉及到 `chat_id` 和 `sender_id` 的 DISTINCT 计数，你可能需要为 `chat_id` 和 `sender_id` 创建索引，以加速这部分的计算。

2. **预计算：**如果可能，你可以考虑预计算结果。例如，你可以创建一个触发器，在插入、更新或删除 `tb_groupchat` 表的行时更新计数。这样，你可以直接从预计算的值中读取计数，而不是在每次查询时计算它们。

3. **优化 DISTINCT 计数：**对于 DISTINCT 计数，如果你能容忍一些误差，你可以使用一些近似 DISTINCT 计数的方法，例如使用 HyperLogLog 算法，它可以以较小的内存开销得到近似的结果。注意，这需要 PostgreSQL 9.4 或更高版本。

4. **查询分解：**你可以考虑将查询分解为几个独立的查询，每个查询只计算一个计数。这可能会减少数据库的工作量，尽管它会增加应用程序的复杂性。

5. **硬件升级：**如果上述方法都不能满足你的需要，可能需要考虑升级硬件，例如增加内存、使用更快的存储设备，或者增加处理器核心数。

希望这些建议对你有所帮助。如果你能提供更多关于你的应用场景、数据量和性能要求的信息，我可能能提供更具体的建议。
```