这一行 Python 代码，如何更快的提高运算速度？

2017-05-13 21:39:25 +08:00

kingmo888

end <str>:'2013-01-22'

window <num>:10

tmp_data = data[data['date']<=end].tail(window*251)

上述代码要进行差不多 2000 次循环，这部分代码的执行时间占比高达 85%+，求问，如何提高这一行代码的执行速度？

某单次统计数据（单位:秒）：

循环总耗时：107.49908785695135

该条代码耗时：100.13774995734286

代码耗时占比：93.152%

PS：

1、试过随着循环缩减 data 前面的数据，结果更慢了。

2、原本 index 为 datetime，无厘头的将 date 赋值到 index，运算结束，然后再将原 index 赋值回去。速度提升一倍左右。如下：

循环总耗时：45.289391494105644

该条代码耗时：37.87854096882086

代码耗时占比：83.637%

3620 次点击

所在节点

Python

15 条回复

O14

2017-05-13 22:24:22 +08:00

按 date 字段排序，二分查找 end 位置，直接向前 slice。
代码不全不好说，建议先不用 Pandas，看看瓶颈在哪。

EmdeBoas

2017-05-13 22:48:12 +08:00

...不会用 v2 的贴图..直接上代码了 5000 次循环秒出 <br>
import datetime
import numpy as np

if __name__ == '__main__':
array = np.array([])
for i in xrange(5000):
year = np.random.randint(100) + 2000
month = np.random.randint(12) + 1
day = np.random.randint(27) + 1
array = np.append(array, datetime.datetime(year, month, day))
end = datetime.datetime(2050, 5, 3)
print array[array[:]>end][-5:]

kingmo888

2017-05-13 23:09:23 +08:00

@EmdeBoas 你好，数据有 100 多万行

kingmo888

2017-05-13 23:21:17 +08:00

@O14 感觉瓶颈就是 data 行数实在太大。
@EmdeBoas 并且 date 字段为 str，如果转 np.array,然后查找到匹配的数据集后再转换为 str，中间的时间损耗很大。

herozhang

2017-05-13 23:23:56 +08:00

并行？

kingmo888

2017-05-13 23:25:20 +08:00

@herozhang 这其实是并行中子函数的一部分了。所以并行不再考虑

EmdeBoas

2017-05-13 23:34:49 +08:00

...我把数据量调到了 100W 然后试的
In [8]: def test():
...: array = []
...: for i in xrange(1000000):
...: year = np.random.randint(100) + 2000
...: month = np.random.randint(12) + 1
...: day = np.random.randint(28) + 1
...: array.append(datetime.datetime(year, month, day))
...: narray = np.array(array)
...: flag = datetime.datetime(2015, 5, 3)
...: print narray[narray[:]<=flag][-5:]
...:

In [9]: %timeit test()
[datetime.datetime(2012, 12, 11, 0, 0) datetime.datetime(2009, 5, 26, 0, 0)
datetime.datetime(2014, 6, 12, 0, 0) datetime.datetime(2008, 11, 23, 0, 0)
datetime.datetime(2010, 12, 12, 0, 0)]
[datetime.datetime(2009, 4, 8, 0, 0) datetime.datetime(2013, 2, 10, 0, 0)
datetime.datetime(2008, 2, 4, 0, 0) datetime.datetime(2010, 11, 2, 0, 0)
datetime.datetime(2005, 8, 27, 0, 0)]
[datetime.datetime(2004, 6, 19, 0, 0) datetime.datetime(2010, 5, 7, 0, 0)
datetime.datetime(2012, 5, 15, 0, 0) datetime.datetime(2012, 6, 7, 0, 0)
datetime.datetime(2000, 1, 5, 0, 0)]
[datetime.datetime(2014, 11, 6, 0, 0) datetime.datetime(2005, 9, 15, 0, 0)
datetime.datetime(2008, 11, 5, 0, 0) datetime.datetime(2007, 6, 9, 0, 0)
datetime.datetime(2003, 11, 10, 0, 0)]
1 loop, best of 3: 4.18 s per loop

kingmo888

2017-05-13 23:40:47 +08:00

@EmdeBoas 已感谢，还是无法大幅解决我的问题（可能我技术渣渣，嘿嘿），但还是感谢。

data 是 pandas 中的 DataFrame 格式数据，牵扯到一个 df 切片的问题。目前我做这样的尝试：

循环之前，a=np.array(data['date'])
循环中，
b=a[a<=end]

index=len(b)

利用其行号来查找：
tmp_data = data.loc[data.index[index-window*251:index],:]

这样速度提升了 8 秒左右。全部循环完差不多 36s+。

kingmo888

2017-05-13 23:44:13 +08:00

@EmdeBoas 给梯子就爬一下下（捂脸。。）
不知有否联系方式？我把数据发一份给您？您看能优化到多久的耗时？

sagaxu

2017-05-13 23:56:13 +08:00

数据贴网盘里，明天看看

EmdeBoas

2017-05-13 23:59:51 +08:00

老哥你别老复杂化问题啊.....loc 和 iloc 要少用效率很低的
In [17]: def test():
...: array = []
...: other = []
...: for i in xrange(1000000):
...: [TAB]other.append(i)
...: [TAB]year = np.random.randint(100) + 2000
...: [TAB]month = np.random.randint(12) + 1
...: [TAB]day = np.random.randint(28) + 1
...: [TAB]array.append(datetime.datetime(year, month, day))
...: narray = np.array(array)
...: flag = datetime.datetime(2015, 5, 3)
...: df = pd.DataFrame()
...: df['date'] = narray
...: df['other'] = other
...: print df[narray[:]<=flag][-5:]
...:

In [18]: %timeit test()
date other
999979 2006-07-05 999979
999980 2012-09-19 999980
999981 2010-05-13 999981
999990 2007-10-14 999990
999996 2008-10-19 999996
date other
999979 2002-08-01 999979
999983 2001-10-01 999983
999984 2007-04-05 999984
999988 2014-04-21 999988
999991 2008-01-06 999991
date other
999977 2004-05-04 999977
999981 2004-05-05 999981
999990 2003-10-04 999990
999991 2003-03-28 999991
999992 2002-12-09 999992
date other
999964 2006-12-13 999964
999970 2012-07-07 999970
999971 2009-12-15 999971
999976 2004-07-22 999976
999982 2009-11-14 999982
1 loop, best of 3: 4.58 s per loop

In [19]:

EmdeBoas

2017-05-14 00:01:40 +08:00

@kingmo888 (⊙﹏⊙)....我不是做 python 的（ python 的并行库我都没用过）只是做数模比赛做多了对 numpy 和 pandas 熟悉一点.... java 的 hadoop 我可能还能帮你看看...

herozhang

2017-05-15 14:58:09 +08:00

100w 数据量，在我的 mbp 上跑：
单核：
7.87 s ± 175 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
并行 4 核：
1.88 s ± 9.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

herozhang

2017-05-15 15:00:16 +08:00

要进一步加速，可以考虑用 pypy ？

ruoyu0088

2017-05-16 21:38:10 +08:00

两千次循环中什么是变量？循环中进行了哪些计算？

对于你贴出来的那一行代码，下面的方法可能更快。

data.take(np.where((data["date"] <= end).values)[-window*251:][0])

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/361147

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.