@
deweixu @
Features 按照 #3 楼的第二种统计方式,用 SQLite 测试了生成整张表、统计整张表(文末附上源码)。结果如下:
日期范围 新用户数 消费记录数 生成用时 统计用时 内存使用
————————————————————————————————————
30天 300万 2600万 30秒 2.4秒(单线程) 3MB
30天 3000万 2.6亿 300秒 7.7秒(四线程) 14MB
(环境:i5-8250U 轻薄本,Windows 10 。感觉速度和内存占用表现都还可以)
## 数据生成规则(以 30 天内 300W 用户 2600W 消费记录为例):
1. 每天新增 10W 用户 *(第一天新增 `user_id` 为 `[1, 10W]`,第二天新增 `uid` 为 `[10W+1, 20W]`,……)*
2. `uid` 为 `0` 的是老用户,在起始日期前一天( 1999-12-31 )注册 *(用于检查统计时,是否已把老用户数据剔除在外)*
3. 每个用户连续 10 天,每天充值 1 元 *(`uid = 0` 的老用户每天都在充值)*
4. 从第一天开始,每两天投广告 100W 元 *(即,2000-01-01 、2000-01-03 、……)*
## 统计结果预览(以 30 天内 300W 用户 2600W 消费记录为例):
日期 当天新用户收入 累计新用户收入 累计广告投入 ROI
———————————————————————————————————
01-01 10W 10W 100W 10.00%
01-02 20W 30W 100W 30.00%
01-03 30W 60W 200W 30.00%
01-04 40W 100W 200W 50.00%
01-05 50W 150W 300W 50.00%
01-06 60W 210W 300W 70.00%
01-07 70W 280W 400W 70.00%
01-08 80W 360W 400W 90.00%
01-09 90W 450W 500W 90.00%
01-10 100W 550W 500W 110.00%
01-11 100W 650W 600W 108.33%
01-12 100W 750W 600W 125.00%
……
01-28 100W 2350W 1400W 167.86%
01-29 100W 2450W 1500W 163.33%
01-30 100W 2550W 1500W 170.00%
## 源码使用方式:
去 SQLite 官网下载个 1 MB 的 sqlite3.exe ,然后保存下面的 SQLite 代码为 main.sql ,然后命令行运行:
```shell
sqlite3.exe data.db < main.sql
```
多线程用到了 Python 。在 sqlite3.exe 生成数据库后,可直接运行
## SQLite 建表和统计(单线程)代码:
*( V 站排版原因,行首有全角空格)*
```sql
PRAGMA journal_mode = off; -- 取消日志记录。这会输出个 off 。。
PRAGMA synchronous = off; -- 提交写请求给操作系统后,就可继续后续计算
.param init
-- 投资数据生成配置(日期间隔、每次投资额、日期范围)
.param set $INVEST_INTERVAL_DAYS 2
.param set $INVEST_AMOUNT_PER_DAY 1000000
.param set $INVEST_START_DATE "'2000-01-01'"
.param set $INVEST_END_DATE "'2000-01-30'"
-- 用户消费数据生成配置(消费天数、每日新增用户数、日期范围)
.param set $CONSUME_DAYS 10
.param set $DAILY_NEW_USERS 100000
.param set $CONSUME_START_DATE "'2000-01-01'"
.param set $CONSUME_END_DATE "'2000-01-30'"
-- 查询数据配置
.param set $QUERY_START_DATE "'2000-01-01'"
.param set $QUERY_END_DATE "'2000-01-30'"
-- 建表:投资表
CREATE TABLE invest (
date DATE PRIMARY KEY,
amount INT
);
-- 建表:消费记录表
CREATE TABLE consume (
uid INT,
date DATE,
reg_date DATE,
amount INT,
PRIMARY KEY (date, reg_date, uid)
) WITHOUT ROWID;
-- 添加投资数据:在指定日期范围内,每 INVEST_INTERVAL_DAYS 天投 INVEST_AMOUNT_PER_DAY 元
INSERT INTO invest (date, amount)
SELECT day.value, $INVEST_AMOUNT_PER_DAY
FROM generate_series(unixepoch($INVEST_START_DATE) / 86400, unixepoch($INVEST_END_DATE) / 86400, $INVEST_INTERVAL_DAYS) day;
-- 添加消费记录
INSERT INTO consume (amount, uid, date, reg_date)
-- 1. 从起始日期前一天开始,user_id = 0 的老用户,每天消费 1 元,直至结束日期
SELECT 1, 0, date.value, unixepoch($CONSUME_START_DATE, '-1 day') / 86400
FROM generate_series(unixepoch($CONSUME_START_DATE, '-1 day') / 86400, unixepoch($CONSUME_END_DATE) / 86400) date
UNION ALL
-- 2. 在指定日期范围内,每天有 DAILY_NEW_USERS 名新用户,连续 CONSUME_DAYS 天消费 1 元
SELECT 1,
user.value,
unixepoch($CONSUME_START_DATE, (day.value - 1) || ' days') / 86400,
unixepoch($CONSUME_START_DATE, ((user.value - 1) / $DAILY_NEW_USERS) || ' days') / 86400
FROM generate_series(1, (unixepoch($CONSUME_END_DATE) - unixepoch($CONSUME_START_DATE)) / 86400 + 1) day
JOIN generate_series(MAX(0, day.value - $CONSUME_DAYS) * $DAILY_NEW_USERS + 1, day.value * $DAILY_NEW_USERS) user;
-- 统计:指定日期范围内,新用户投资回报率
-- ( user_id = 0 的用户,在起始日期前一天注册,是老用户,故不会统计)
WITH
-- 每日新用户当天收入表
daily(date, income) AS (
SELECT date, SUM(amount)
FROM consume
WHERE reg_date BETWEEN unixepoch($QUERY_START_DATE) / 86400 AND unixepoch($QUERY_END_DATE) / 86400
GROUP BY date
)
SELECT date(
daily.date * 86400, 'unixepoch') 日期,
income 当天新用户收入,
SUM(income) OVER win 累计新用户收入,
SUM(invest.amount) 累计广告投入,
FORMAT('%.2f%%', SUM(income) OVER win * 100.0 / SUM(invest.amount)) ROI
FROM daily
LEFT JOIN invest ON
invest.date BETWEEN unixepoch($QUERY_START_DATE) / 86400 AND
daily.date GROUP BY
daily.dateWINDOW win AS (ORDER BY
daily.date);
```
## Python 多线程统计代码:
*( V 站排版原因,行首有全角空格)*
```python
import time
import sqlite3
from contextlib import closing
from datetime import date, timedelta
from concurrent.futures import ThreadPoolExecutor
THREADS = 4 # 线程数
DB_FILE = 'data.db' # 数据库路径地址
QUERY_START_DATE = '2000-01-01'
QUERY_END_DATE = '2000-01-30'
def sub(days):
with closing(sqlite3.connect(DB_FILE)) as db:
return db.execute('''
SELECT date, SUM(amount)
FROM consume
WHERE date = strftime('%s', ?) / 86400
AND reg_date BETWEEN strftime('%s', ?) / 86400 AND strftime('%s', ?) / 86400
''', [
str(date.fromisoformat(QUERY_START_DATE) + timedelta(days=days)),
QUERY_START_DATE,
QUERY_END_DATE,
]).fetchone()
def main():
with closing(sqlite3.connect(DB_FILE)) as db, ThreadPoolExecutor(max_workers=THREADS) as executor:
begin = time.time()
data = list(executor.map(sub, range((date.fromisoformat(QUERY_END_DATE) - date.fromisoformat(QUERY_START_DATE)).days + 1)))
db.execute('CREATE TEMP TABLE daily (date DATE PRIMARY KEY, income INT)')
db.executemany('INSERT INTO daily VALUES (?, ?)', data)
cursor = db.execute('''
SELECT date(
daily.date * 86400, 'unixepoch') 日期,
income 当天新用户收入,
SUM(income) OVER win 累计新用户收入,
SUM(invest.amount) 累计广告投入,
PRINTF('%.2f%%', SUM(income) OVER win * 100.0 / SUM(invest.amount)) ROI
FROM daily
LEFT JOIN invest ON
invest.date BETWEEN strftime('%s', ?) / 86400 AND
daily.date GROUP BY
daily.date WINDOW win AS (ORDER BY
daily.date)
''', [QUERY_START_DATE])
print(
f'Finished in {time.time() - begin:.2f} sec. Result:',
[col[0] for col in cursor.description],
*cursor,
sep='\n',
)
if __name__ == '__main__':
main()
```