V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
airbob
V2EX  ›  Python

一个读取和分析大文件的 python script 的优化问题

  •  
  •   airbob ·
    airbob · 2014-02-27 16:53:36 +08:00 · 8391 次点击
    这是一个创建于 4013 天前的主题,其中的信息可能已经有所发展或是发生改变。
    老板要写一个script 来分析几个文档,一个文档大约28,000,000 行(文件大小700M - 1G左右),有六个类似文档。
    要求就是把里面同一个user id, 同一月里是同一hour,同一个weekday的数据加和,然后写出文档。
    比如5.1号2点和5.8号2点的数据要加起来,如果时间段没相关数据,则为0
    (文档里数据为2012年5月到12月)

    script 写好了,在小文件上试验都可以跑而且结果是对的,现在问题是用在大文档上面了,已经跑了20个小时左右了,才跑了十分之一左右,若照现在的进度,这个script 跑完应该要花四五天。

    我怀疑的瓶颈大约有两个一个有可能在文件读写IO,另一个可能是我的for loop (遍历5-12月,monday - sunday, 00点到23点)里花了太长时间process, 目前我还没想到可以优化的方法。


    文件格式和现在的现在script都放在gist 里了,哪位大神给看看?

    background

    Below is input File format(*.txt):

    userID month date hour totalTW totalQs result
    21535110 05 01 02 3 2 1
    21535110 05 01 03 3 2 1
    21535110 05 01 06 1 0 0
    21535110 05 02 02 1 0 0
    21535110 05 03 05 3 2 0
    21535112 05 01 05 1 1 1

    totally there are 28,000,000 lines in the file, and I have 6 this kind of files.

    object

    write script to process the input data, to:
    for each user, sum up the data (totalTW, totalQS, result) within same month, same day of the week, same hour.
    lets say: there are lines like this(year is 2012):

    userID month date hour totalTW totalQs result
    21535110 05 01 02 3 2 1
    21535110 05 08 02 2 1 0

    then this 2 data points should sum since they both belong to tue of May and hour is 02

    userID month day hour totalTW totalQs result
    21535110 05 Tue 02 5 3 1

    Problem

    the week.py script I added in this gist is working, the problem is, it seems too slow.
    I used lab server to run it for ~20 hours and it is currently processing at 2,300,000 (about 10% ! only)
    Is there any way to optimize this script?

    view raw readme.md hosted with ❤ by GitHub
    #!/usr/bin/python
    import os
    import sys
    import csv
    import re
    import string
    import time
    import datetime
    '''
    This is modified version of week.py
    I raised this question in http://v2ex.com/t/102160 and
    thanks to v2ex fellows, the bottleneck is mainly due to the 3 for loops (which is quite a dummy mistake)
    with this version of the script, execution time has been reduced tremendously to ~10-20 mins, which fits my need for now.
    '''
    def main():
    start_time = time.time()
    weekday = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"]
    inputfilename = "./test_refine/test.txt"
    outputfilename = "test_week/" + inputfilename[14:]
    print outputfilename
    open_file = open(inputfilename, 'r')
    contents = open_file.readlines()
    to_file = open(outputfilename, 'w')
    i = 0
    totalLines = len(contents)
    totalLines = int(totalLines)
    while i < totalLines:
    outputCONTENT = ""
    print i
    if ( i == totalLines-1):
    print time.time()-start_time , "seconds"
    return
    if (i>0):
    lineString = contents[i]
    user = lineString.split()[0]
    j = i
    nextFlag = 1
    while (nextFlag == 1 and ( j < totalLines )):
    tempString = contents[j]
    user2 = tempString.split()[0]
    if (user != user2):
    nextFlag = 0
    j = j + 1
    markIndex = j
    ## do the main check
    totalTW = {}
    totalQS = {}
    totalResult = {}
    for z in range(i,markIndex):
    tempString = contents[z]
    tweetmonth = tempString.split()[1]
    tweetday = tempString.split()[2]
    tweethour = tempString.split()[3]
    tweetTW = tempString.split()[4]
    tweetQS = tempString.split()[5]
    tweetResult = tempString.split()[6]
    tweetdate = "%s-%s-%s"%("2012",tweetmonth,tweetday)
    dayOfWeek = datetime.datetime.strptime(tweetdate, "%Y-%m-%d").strftime('%a')
    key = "%s%s%s" % ( tweetmonth, dayOfWeek, tweethour)
    if key in totalTW:
    totalTW[key]+=int(tweetTW)
    totalQS[key]+=int(tweetQS)
    totalResult[key]+=int(tweetResult)
    else:
    totalTW[key]=int(tweetTW)
    totalQS[key]=int(tweetQS)
    totalResult[key]=int(tweetResult)
    for month in range(5,13):
    for day in weekday:
    for hour in range(0,24):
    key = "%02d%s%02d" % ( month, day, hour)
    if key in totalTW:
    lineoutput = "%s\t%02d\t%s\t%02d\t%s\t%s\t%s\n" %(user,month,day,hour,totalTW[key],totalQS[key],totalResult[key])
    outputCONTENT = outputCONTENT + lineoutput
    else:
    lineoutput = "%s\t%02d\t%s\t%02d\t%s\t%s\t%s\n" %(user,month,day,hour,0,0,0)
    outputCONTENT = outputCONTENT + lineoutput
    ## to_file.write(lineoutput)
    i = markIndex-1
    else:
    ## to_file.write(contents[0])
    outputCONTENT = outputCONTENT + contents[0]
    i = i + 1
    to_file.write(outputCONTENT)
    to_file.close()
    open_file.close()
    if __name__ == "__main__":
    main()
    #!/usr/bin/python
    import os
    import sys
    import csv
    import re
    import string
    import time
    import datetime
    '''
    weekday of each month
    '''
    def main():
    weekday = ["Mon","Tue","Wed","Thu","Fri","Sat","Sun"]
    ## read stats file and filter not existing users
    inputfilename = "input.txt"
    outputfilename = "output.txt"
    print outputfilename
    open_file = open(inputfilename, 'r')
    contents = open_file.readlines()
    to_file = open(outputfilename, 'w')
    i = 0
    totalLines = len(contents)
    totalLines = int(totalLines)
    print "going to while loop"
    while i < totalLines:
    outputCONTENT = ""
    print i
    if ( i == totalLines-1):
    return
    if (i>0):
    lineString = contents[i]
    user = lineString.split()[0]
    j = i
    nextFlag = 1
    while (nextFlag == 1 and ( j < totalLines )):
    tempString = contents[j]
    user2 = tempString.split()[0]
    if (user != user2):
    nextFlag = 0
    j = j + 1
    markIndex = j
    for month in range(5,13):
    for day in weekday:
    for hour in range ( 0, 24):
    ## print "%s-%s-%s-%s" % (user,month, day , hour)
    totalTW = 0
    totalQS = 0
    totalResult = 0
    for z in range(i,markIndex):
    tempString = contents[z]
    tweetmonth = tempString.split()[1]
    tweetday = tempString.split()[2]
    tweethour = tempString.split()[3]
    tweetTW = tempString.split()[4]
    tweetQS = tempString.split()[5]
    tweetResult = tempString.split()[6]
    tweetdate = "%s-%s-%s"%("2012",tweetmonth,tweetday)
    dayOfWeek = datetime.datetime.strptime(tweetdate, "%Y-%m-%d").strftime('%a')
    if ( day in dayOfWeek and hour == int(tweethour) and month ==int(tweetmonth) ):
    totalTW += int(tweetTW)
    totalQS += int(tweetQS)
    totalResult += int(tweetResult)
    lineoutput = "%s\t%02d\t%s\t%02d\t%s\t%s\t%s\n" %(user,month,day,hour,totalTW,totalQS,totalResult)
    ## to_file.write(lineoutput)
    outputCONTENT = outputCONTENT + lineoutput
    i = markIndex-1
    else:
    ## to_file.write(contents[0])
    outputCONTENT = outputCONTENT + contents[0]
    i = i + 1
    to_file.write(outputCONTENT)
    to_file.close()
    open_file.close()
    if __name__ == "__main__":
    main()
    view raw week.py hosted with ❤ by GitHub



    谢谢!
    15 条回复    1970-01-01 08:00:00 +08:00
    wangfengmadking
        1
    wangfengmadking  
       2014-02-27 17:11:55 +08:00
    这种大文件,最好用

    with open("file_name", 'r') as input:
    for line in input:
    #process
    JerryKwan
        2
    JerryKwan  
       2014-02-27 17:14:42 +08:00
    @airbob
    对于大文件最好不要使用readlines
    exch4nge
        3
    exch4nge  
       2014-02-27 17:19:35 +08:00
    疑问:每个用户即使在某个时间段没有记录也需要去输出Log吗?

    你那个循环总共循环 月数 * 7 * 24 * 这个User的记录数。肯定会慢吧。。。
    你应该按照User的记录进行循环,分析每条记录,分析出是哪月星期几几时,然后原值上加上这条记录的值。
    airbob
        4
    airbob  
    OP
       2014-02-27 17:28:16 +08:00
    @wangfengmadking @JerryKwan 好的, 我试下with open("file_name", 'r') as input: 然后计算下时间
    airbob
        5
    airbob  
    OP
       2014-02-27 17:34:11 +08:00
    @exch4nge 嗯,要求是没记录也log,这样每个user有相同数目的data point.

    你说的很对,good point! 确实没有必要3个for loop, 可以直接定义个起始都为0 的2D array ( 3 x 8*7*24 ) 来存记录,这样遍历一遍有符合的相加就可以了。

    非常感谢!
    family
        6
    family  
       2014-02-27 17:37:33 +08:00
    先把文件拆分为方便计算的子文件,可以按照时间一共拆分24个文件,然后统计。
    exch4nge
        7
    exch4nge  
       2014-02-27 17:53:35 +08:00
    @airbob 提个建议啊

    - python支持 a, b = ['a', 'b'] 这类的赋值。
    - 可以开辟array来记录,但中间你需要做字符串--数字的映射关系。如果用dict来记录的话,可以省很多事。
    - python中的字符串是不可变的,所以每次outputCONTENT = outputCONTENT + contents[0] 好像会增加内存使用量的。你这例子还好,内存够用,如果再大点,就得注意内存问题了。
    airbob
        8
    airbob  
    OP
       2014-02-27 20:25:03 +08:00
    @exch4nge @family 谢谢大家,改好了,现在跑一遍10-20分钟,6个文档已经跑完了,week-modified.py更新了放在gist 上了~
    likuku
        9
    likuku  
       2014-02-28 00:53:25 +08:00
    嗯,dict 或者 dict 的嵌套 作状态记录很方便,查询更新也方便。
    lixm
        10
    lixm  
       2014-02-28 09:46:36 +08:00
    导入mysql,直接sql查询
    oio
        11
    oio  
       2014-02-28 10:16:01 +08:00
    我也来试试看....不适用于乱序数据....

    <script src="https://gist.github.com/onia/9263792.js"></script>
    exch4nge
        12
    exch4nge  
       2014-02-28 16:14:22 +08:00
    @oio 学习了csv, groupby等炫酷的玩意
    anguskwan
        13
    anguskwan  
       2014-02-28 17:10:24 +08:00
    导入mysql ^ ^
    oio
        14
    oio  
       2014-03-01 14:33:56 +08:00
    @exch4nge, 我总觉得满足 PEP8 每行 <79 很难,怎么破....
    exch4nge
        15
    exch4nge  
       2014-03-01 22:57:19 +08:00
    @oio 2格缩进 + 合理的换行?
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2529 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 22ms · UTC 03:49 · PVG 11:49 · LAX 19:49 · JFK 22:49
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.