Python 读取 json 数据的问题

2023-02-23 14:26:24 +08:00
 0littleboy

最近在使用 Pandas 处理 json 数据时遇到了 ValueError: Protocol not known 的问题

后面使用 json 库就解决了,不明白为什么

json 的数据就是 data 里包个 contestUpcomingContests ,里面再包一个数组,内有两个元素

import json
import pandas as pd

data = '{"data":{"contestUpcomingContests":[{"containsPremium":false,"title":"\u7b2c 99 \u573a\u53cc\u5468\u8d5b","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/biweekly-contest-99/contest_detail/pc_card.png","titleSlug":"biweekly-contest-99","startTime":1677940200,"duration":5400,"originStartTime":1677940200},{"containsPremium":false,"title":"\u7b2c 334 \u573a\u5468\u8d5b","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/weekly-contest-334/contest_detail/pc_card.png","titleSlug":"weekly-contest-334","startTime":1677378600,"duration":5400,"originStartTime":1677378600}]}}'

df1 = json.loads(data)
print(df1)
df2 = pd.read_json(data)
print(df2)
{'data': {'contestUpcomingContests': [{'containsPremium': False, 'title': '第 99 场双周赛', 'cardImg': 'https://assets.leetcode.cn/aliyun-lc-upload/contest-config/biweekly-contest-99/contest_detail/pc_card.png', 'titleSlug': 'biweekly-contest-99', 'startTime': 1677940200, 'duration': 5400, 'originStartTime': 1677940200}, {'containsPremium': False, 'title': '第 334 场周赛', 'cardImg': 'https://assets.leetcode.cn/aliyun-lc-upload/contest-config/weekly-contest-334/contest_detail/pc_card.png', 'titleSlug': 'weekly-contest-334', 'startTime': 1677378600, 'duration': 5400, 'originStartTime': 1677378600}]}}
Traceback (most recent call last):
  File "/Users/world/Developer/AlgorithmSharkSpider/test.py", line 8, in <module>
    df2 = pd.read_json(data)
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/util/_decorators.py", line 199, in wrapper
    return func(*args, **kwargs)
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
    return func(*args, **kwargs)
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/json/_json.py", line 540, in read_json
    json_reader = JsonReader(
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/json/_json.py", line 622, in __init__
    data = self._get_data_from_filepath(filepath_or_buffer)
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/json/_json.py", line 659, in _get_data_from_filepath
    self.handles = get_handle(
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/common.py", line 558, in get_handle
    ioargs = _get_filepath_or_buffer(
  File "/opt/homebrew/lib/python3.9/site-packages/pandas/io/common.py", line 333, in _get_filepath_or_buffer
    file_obj = fsspec.open(
  File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 419, in open
    return open_files(
  File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 272, in open_files
    fs, fs_token, paths = get_fs_token_paths(
  File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 574, in get_fs_token_paths
    chain = _un_chain(urlpath0, storage_options or {})
  File "/opt/homebrew/lib/python3.9/site-packages/fsspec/core.py", line 315, in _un_chain
    cls = get_filesystem_class(protocol)
  File "/opt/homebrew/lib/python3.9/site-packages/fsspec/registry.py", line 208, in get_filesystem_class
    raise ValueError("Protocol not known: %s" % protocol)
ValueError: Protocol not known: {"data":{"contestUpcomingContests":[{"containsPremium":false,"title":"第 99 场双周赛","cardImg":"https
1698 次点击
所在节点    Python
2 条回复
bomb77
2023-02-23 14:41:24 +08:00
python3.8
pandas 1.5.3
测试没有报错

升级下版本或者看看是不是编码啥的问题?
dcopen
2023-02-23 15:00:19 +08:00
这个问题发生在使用 Pandas 的 read_json() 函数时,该函数使用了 fsspec 库进行文件处理和读取。而在 fsspec 0.9.0 版本之后,它引入了一个新的 URL 解析机制,导致了该错误。

在处理 JSON 数据时,您可以使用 json 库将其转换为 Python 字典,然后再使用 Pandas 的 json_normalize() 函数将其展平为 Pandas 数据帧。下面是一个示例代码:

```
import json
import pandas as pd

data = '{"data":{"contestUpcomingContests":[{"containsPremium":false,"title":"第 99 场双周赛","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/biweekly-contest-99/contest_detail/pc_card.png","titleSlug":"biweekly-contest-99","startTime":1677940200,"duration":5400,"originStartTime":1677940200},{"containsPremium":false,"title":"第 334 场周赛","cardImg":"https://assets.leetcode.cn/aliyun-lc-upload/contest-config/weekly-contest-334/contest_detail/pc_card.png","titleSlug":"weekly-contest-334","startTime":1677378600,"duration":5400,"originStartTime":1677378600}]}}'

data_dict = json.loads(data)
df = pd.json_normalize(data_dict, record_path=['data', 'contestUpcomingContests'])
print(df)

```

输出:
```
containsPremium title cardImg titleSlug startTime duration originStartTime
0 False 第 99 场双周赛 https://assets.leetcode.cn/aliyun-lc-upload/co... biweekly-contest-99 1677940200 5400 1677940200
1 False 第 334 场周赛 https://assets.leetcode.cn/aliyun-lc-upload/co... weekly-contest-334 1677378600 5400 1677378600

```

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/918548

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX