背景：需要从 nginx 日志中分析出一个列表页中各种商品曝光次数。数据存储在 hive 表里面。

核心字段有 3 个，request （用户请求参数），response(接口返回)，url(接口地址)，这 3 个都是 string 类型

列表接口的 response 大致长这个样子

{
  "errno": 0,
  "errmsg": "SUCCESS",
  "data": {
    "list": [
      {
        "id": 123,
        "name": "xxx",
        "other": {
          "xxx": 1
        }
      },
      {
        "id": 456,
        "name": "xxx",
        "other": {
          "xxx": 1
        }
      },
      {
        "id": 789,
        "name": "xxx",
        "other": {
          "xxx": 1
        }
      },
      {
        "id": 222,
        "name": "xxx",
        "other": {
          "xxx": 1
        }
      }
    ],
    "page":{
      "current_page":1,
      "total_page":2151,
      "count":10,
      "total_count":21508
    }
  }
}

我目前的思路：

step1:通过 get_json_object 函数取到 response 中的 list 字段

step2:正则替换掉数组的第一个'['和最后一个']'

step3:把数据分割成多个 json object 数组

step4:把 json object 数组转成列数据

step5: 再使用 get_json_object 函数取出每一行的 id

step6: group by id 然后计数

目前写出来的 SQL

select
    j.json_detail
from
    (
        select
            split(
                regexp_replace(
                    regexp_extract(
                        get_json_object(
                            response,
                            '$.data.list'
                        ),
                        '^\[(.+)\]$',
                        1
                    ),
                    '\}\}\,\{',
                    '\}\}\|\|\{'
                ),
                '\\|\\|'
            ) as info
        from
            nginx_log
        where
            concat_ws('-', year, month, day) = '${query_time}'
            and url = 'product_list'
    ) m lateral view explode(info) j as json_detail;

目前这个 SQL 写到第 4 步就不能运行了，报错

执行用时：2 分钟 22 秒
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=
INFO  : number of splits:2
INFO  : Submitting tokens for job: job_xxx_xxx
INFO  : The url to track the job: http://xxxxx/xxxx
INFO  : Starting Job = job_xxx_xxx, Tracking URL = xxx
INFO  : Kill Command = /usr/local/hadoop-current/bin/hadoop job  -killjob_xxx_xxx
INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO  : 2021-10-20 19:13:29,451 Stage-1 map = 0%,  reduce = 0%
INFO  : 2021-10-20 19:14:14,588 Stage-1 map = 100%,  reduce = 100%
ERROR : Ended Job = job_xxx_xxx with errors
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

stackoverflow 这个贴子说这个报错并没有指出哪里出错了，但是我也不知道去哪里查详细的日志。

最后题外话：数据分析可真难啊，我这个 CUDR boy 以前写的 SQL 最多不超过 10 行，这几天兼职数据分析师，写 SQL 动不动就几百行，我头发都要用完了

彦祖们，求指点下这个查询 SQL 该怎么写呀

最后题外话：数据分析可真难啊，我这个 CUDR boy 以前写的 SQL 最多不超过 10 行，这几天兼职数据分析师，写 SQL 动不动就几百行，我头发都要用完了