@
la0wei #15 装个[油猴插件](
https://greasyfork.org/zh-CN/scripts/449771-v2ex 评论 markdown 支持),就能渲染 markdown 内容了。
## 1. 为什么你的 network 时间这么长?
我认为,主要是你把数据库里的内容,全部传输到 Python ,导致长时间耗时在数据传输上。
所以,让数据库自己计算出最终结果,再保存成 csv ,应该能节省很多时间。
*(除非数据库计算性能太差,还不如传数据让 Python 来)*
## 2. SQLite 可在 30 秒内,处理一亿数据,生成所需 CSV ( 9 字段,2000 行)
### 2.1 说明
我用 SQLite 进行了下测试。预先生成**一亿**行随机数据。各字段取值范围:
- **devicecode**:[1000001, 1002000]
- **date**:[2023-01-01 00:00:00, 2023-06-25 00:00:00]
- **code**:[9520, 9529]
- **status**:[1, 4]
然后假设 CSV 的设备编号是 1000001 ~ 1002000 ,让 SQLite 计算如下字段的值:
- 设备编号
- 3 个月内数据量
- 3 个月内上传量
- 1 个月内数据量
- 1 个月内上传量
- 1 个月内 9527 数据量
- 1 个月内 9527 上传量
- 3 个月内最新数据
- 3 个月内最新上传数据
### 2.2 结果
```
[ 0.001s] 开始建表……
[319.384s] 建表完成。正在查询……
[347.006s] 查询完毕!前后五行结果:
(1000001, 26306, 6608, 8786, 2222, 827, 199, '{"业务":9524,"时间":"2023-06-24 23:59:27"}', '{"业务":9524,"时间":"2023-06-24 23:59:27"}')
(1000002, 26351, 6651, 8703, 2201, 854, 228, '{"业务":9524,"时间":"2023-06-24 23:58:17"}', '{"业务":9524,"时间":"2023-06-24 23:58:17"}')
(1000003, 26297, 6655, 8755, 2225, 891, 218, '{"业务":9529,"时间":"2023-06-24 23:47:57"}', '{"业务":9527,"时间":"2023-06-24 23:37:26"}')
(1000004, 26502, 6576, 8812, 2208, 901, 239, '{"业务":9521,"时间":"2023-06-24 23:53:41"}', '{"业务":9521,"时间":"2023-06-24 23:53:41"}')
(1000005, 26225, 6520, 8766, 2128, 902, 219, '{"业务":9527,"时间":"2023-06-24 23:49:21"}', '{"业务":9524,"时间":"2023-06-24 23:19:29"}')
……
(1001996, 26328, 6663, 8853, 2282, 899, 251, '{"业务":9521,"时间":"2023-06-24 23:51:59"}', '{"业务":9528,"时间":"2023-06-24 23:38:47"}')
(1001997, 26186, 6633, 8699, 2234, 914, 242, '{"业务":9527,"时间":"2023-06-24 23:57:44"}', '{"业务":9521,"时间":"2023-06-24 23:31:51"}')
(1001998, 25887, 6418, 8727, 2111, 897, 220, '{"业务":9521,"时间":"2023-06-24 23:46:42"}', '{"业务":9521,"时间":"2023-06-24 23:46:42"}')
(1001999, 26192, 6397, 8686, 2108, 859, 201, '{"业务":9521,"时间":"2023-06-24 23:57:37"}', '{"业务":9529,"时间":"2023-06-24 23:05:04"}')
(1002000, 26070, 6470, 8841, 2166, 857, 220, '{"业务":9528,"时间":"2023-06-24 23:59:51"}', '{"业务":9523,"时间":"2023-06-24 23:21:12"}')
```
### 2.3 Python 代码
运行下面代码大约需要 2.5 GB 内存*(因为直接在内存里建表了)*。
由于 v 站 会吃掉行首空格,所以我替换成了全角空格。若要运行,记得替换回来。
```python
import time
import sqlite3
start_time = time.time()
def debug(s):
print(f'[{time.time() - start_time:7.3f}s] {s}')
db = sqlite3.connect(':memory:')
db.execute('''
CREATE TABLE data(
id INT,
devicecode INT,
date TIMESTAMP,
code INT,
status INT,
PRIMARY KEY (devicecode, date, id)
) WITHOUT ROWID
''')
debug('开始建表……')
db.execute('''
WITH
num10(num) AS (
VALUES (0), (1), (2), (3), (4), (5), (6), (7), (8), (9)
),
num10000(num) AS (
SELECT a.num * 1000 + b.num * 100 + c.num * 10 + d.num
FROM num10 a, num10 b, num10 c, num10 d
),
generate_series(value) AS (
SELECT a.num * 10000 + b.num
FROM num10000 a, num10000 b
)
INSERT INTO data
SELECT
value,
abs(random() % 2000) + 1000001,
abs(random() % (SELECT unixepoch('2023-06-25') - unixepoch('2023-01-01'))) + unixepoch('2023-01-01'),
abs(random() % 10) + 9520,
abs(random() % 4) + 1
FROM generate_series
''')
debug('建表完成。正在查询……')
rows = db.execute('''
WITH
csv(设备编号) AS (
SELECT DISTINCT devicecode
FROM data
)
SELECT
devicecode 设备编号,
count(*) "3 个月内数据量",
IFNULL(SUM(status = 1), 0) "3 个月内上传量",
IFNULL(SUM(date >= (SELECT unixepoch('now', '-1 month'))), 0) "1 个月内数据量",
IFNULL(SUM(date >= (SELECT unixepoch('now', '-1 month')) AND status = 1), 0) "1 个月内上传量",
IFNULL(SUM(date >= (SELECT unixepoch('now', '-1 month')) AND code = 9527), 0) "1 个月内 9527 数据量",
IFNULL(SUM(date >= (SELECT unixepoch('now', '-1 month')) AND code = 9527 AND status = 1), 0) "1 个月内 9527 上传量",
(
SELECT json_object('业务', code, '时间', datetime(date, 'unixepoch'))
FROM data AS inner
WHERE inner.devicecode = data.devicecode
AND
inner.date >= (SELECT unixepoch('now', '-3 month'))
ORDER BY date DESC
LIMIT 1
) "3 个月内最新数据",
(
SELECT json_object('业务', code, '时间', datetime(date, 'unixepoch'))
FROM data AS inner
WHERE inner.devicecode = data.devicecode
AND
inner.date >= (SELECT unixepoch('now', '-3 month'))
AND inner.status = 1
ORDER BY date DESC
LIMIT 1
) "3 个月内最新上传数据"
FROM data
WHERE devicecode IN (SELECT 设备编号 FROM csv)
AND date >= unixepoch('now', '-3 month')
GROUP BY 1
''').fetchall()
debug('查询完毕!前后五行结果:')
print(*rows[:5], '……', *rows[-5:], sep='\n')
```