Python 如何最优雅的遍历嵌套 json？

[
  articles: {
    article_1: {
      image: "https://example.com/1.jpg",
      reviews: [
      	{
          attach: "https://example32.com/test/22.png"
        },
        {
          attaches: ["https://example12.com/test/23.png", "https://example23.com/test/77.zip"
          report_to: "https://tongji.xxx.xxx/xxx/article_1_review_2"
        }
      ]
    },
    article_2: {
      image: "https://example.com/3.jpg",
      related_posts: [
        {
          attach: ["https://example23.com/test/2113.png", "https://example22.com/test/123/77.zip"
          report_to: "https://tongji.xxx.xxx/xxx/article_3"
        }
      ]
    }
  }
]

每项格式完全不统一的这种，别的网站的 API ，非常不标准，随便截的一段例子，中间有的 dict 都嵌套了十几层。原始需求是把所有附件（相当于遍历所有链接，无视 key 因为同一个 json 里 attach 就出现 5 种叫法，估计还有漏，不知道他们客户端怎么解析的）下载下来按原始路径格式整理，例如https://example.com/test/123/77.zip（ report_to 这种路径最后一段没有扩展名的视为其它 API 直接忽略不下载）就放置到下载目录的/test/123/77.zip，虽然反序列化后用递归遍历每个 dict 写起来也很快但感觉太不优雅了，来学习下有没有更优雅的写法

路径里有可能出现中文和特殊符号，一部分有 url_encode 一部分没有，不好直接用 regex 提取。

attach

JSON

report_to

image

6 条回复 • 2022-12-05 12:57:48 +08:00

ClericPy

2022-11-27 16:55:58 +08:00

JSON 是有 json path 这种东西知道吧, 然后用里面的相对路径直接找特定字段名字的就可以了, 嵌套多少层都没问题

目前我用过的类似产品按优先级可以分为如下三类:

1. jmespath
2. objectpath
3. jsonpath-ng(jsonpath-rw and jsonpath-rw-ext)

顺序是性能+生态综合估算的

edis0n0

2022-11-27 16:58:40 +08:00

补充下，attaches: 后面的方括号被我删没了因为这个 list 太长了，原始 JSON 格式是正确的可以被 json.load 反序列化的。

edis0n0

2022-11-27 19:52:49 +08:00

最后还是用了暴力写法 re.findall(r'[\'"]http?([^\'"]+)', data) 虽然不干净但还算能用

edis0n0

2022-11-27 19:53:17 +08:00

@edis0n0 #3 正则括号位置写错了，自己改改

c0nstantien

2022-11-28 17:49:17 +08:00

json path 能很容易达到你的需求

leven87

2022-12-05 12:57:48 +08:00

1. covert string to dict
2. traverse the dict