V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
alexred
V2EX  ›  Python

async+await 异步图片爬虫,爬到最后几张图片会超时错误?

  •  2
     
  •   alexred · 2018-03-19 14:23:21 +08:00 · 3797 次点击
    这是一个创建于 2202 天前的主题,其中的信息可能已经有所发展或是发生改变。

    最近想给自己的壁纸爬虫用上协程提高速率,但是爬虫速度越到后面越慢,且总是无法下载最后几张图片。如果用手机热点的网络下载(速度快得多)则可以成功下载所有图片,不会出现上述问题。

    import os
    import aiohttp
    import asyncio
    
    #图片的 url 数组
    pic_list=["http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_01origin_01_2017341744BB4.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_02origin_03_2017341744FC9.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_03origin_05_2017341744751.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_04origin_07_2017341744DD9.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_05origin_09_2017341744561.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_06origin_11_2017341744BE9.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_07origin_13_20173417443A2.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_5/gamersky_08origin_15_2017341744ABD.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_6/gamersky_02origin_03_2017341746E13.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_6/gamersky_03origin_05_201734174635D.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_6/gamersky_05origin_09_2017341746DF9.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_6/gamersky_06origin_11_20173417465EA.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_01origin_01_2017341747B6B.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_02origin_03_20173417474FD.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_03origin_05_20173417479AF.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_04origin_07_2017341747F66.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_05origin_09_20173417476EE.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_06origin_11_2017341747D76.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_07origin_13_20173417474FE.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_7/gamersky_08origin_15_2017341747B86.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_01origin_01_20173417485A9.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_02origin_03_2017341748B5D.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_03origin_05_20173417482D8.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_04origin_07_201734174812C.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_06origin_11_2017341748D6B.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_07origin_13_2017341748527.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_8/gamersky_08origin_15_2017341748C4C.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_9/gamersky_01origin_01_2017341750ACF.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_9/gamersky_02origin_03_2017341750ED1.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_9/gamersky_04origin_07_2017341750B59.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_9/gamersky_05origin_09_201734175013F.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_9/gamersky_06origin_11_20173417507C7.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_9/gamersky_07origin_13_2017341750E4F.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_10/gamersky_01origin_01_2017341753936.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_10/gamersky_02origin_03_2017341753144.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_10/gamersky_03origin_05_201734175369F.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_10/gamersky_04origin_07_2017341753EFD.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_10/gamersky_05origin_09_2017341753412.jpg", "http://img1.gamersky.com/image2017/03/20170304_zl_91_10/gamersky_06origin_11_201734175329A.jpg"]
    
    #将单个图片下载到本地
    async def get_html(url):
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
        async with aiohttp.ClientSession() as session:
            async with session.get(url, headers=headers) as r:
                path = os.path.join('wallpapers', url.split('/')[-1])
                fp = open(path, 'wb')
                fp.write(await r.read())
                fp.close()
    
    
    def get_many(urls):
    
        loop = asyncio.get_event_loop()
        tasks = [get_html(url) for url in urls]
    
        loop.run_until_complete(asyncio.wait(tasks))
        loop.close()
        
        
    get_many(pic_list)
    

    会报如下的错误

    Task exception was never retrieved
    future: <Task finished coro=<get_html() done, defined at D:/PycharmProjects/learning/asy_img.py:92> exception=TimeoutError()>
    Traceback (most recent call last):
      File "D:/PycharmProjects/learning/asy_img.py", line 99, in get_html
        fp.write(await r.read())
      File "C:\WinPython-64bit-3.6.3.0Qt5\python-3.6.3.amd64\lib\site-packages\aiohttp\client_reqrep.py", line 798, in read
        self._content = await self.content.read()
      File "C:\WinPython-64bit-3.6.3.0Qt5\python-3.6.3.amd64\lib\site-packages\aiohttp\streams.py", line 312, in read
        block = await self.readany()
      File "C:\WinPython-64bit-3.6.3.0Qt5\python-3.6.3.amd64\lib\site-packages\aiohttp\streams.py", line 328, in readany
        await self._wait('readany')
      File "C:\WinPython-64bit-3.6.3.0Qt5\python-3.6.3.amd64\lib\site-packages\aiohttp\streams.py", line 250, in _wait
        await waiter
      File "C:\WinPython-64bit-3.6.3.0Qt5\python-3.6.3.amd64\lib\site-packages\aiohttp\helpers.py", line 661, in __exit__
        raise asyncio.TimeoutError from None
    concurrent.futures._base.TimeoutError
    

    希望能给出原因和优化建议,谢谢 (或者请大佬写一个爬上面图片的异步爬虫 QAQ )

    11 条回复    2018-03-20 12:23:12 +08:00
    qfdk
        1
    qfdk  
       2018-03-19 14:54:31 +08:00 via iPhone
    还以为 js 的呢 py 的就无视了
    JRyan
        2
    JRyan  
       2018-03-19 16:02:42 +08:00
    请求超时了吧,抓异常,然后重试
    sunnyadamm
        3
    sunnyadamm  
       2018-03-19 16:30:58 +08:00
    请求超时,网络问题,换网络或者最后几张不要了
    zzj0311
        4
    zzj0311  
       2018-03-19 16:45:26 +08:00 via Android
    可能触发反爬了吧(你发这么一堆谁来帮你看啊
    alexred
        5
    alexred  
    OP
       2018-03-19 18:55:31 +08:00
    @zzj0311 不是因为触发了反爬,因为我用手机热点下就不会出错
    WillianZhang
        6
    WillianZhang  
       2018-03-19 20:31:35 +08:00
    python 和 async 的思路是對的。
    不過總的來說,對於怕蟲問題, 你需要手寫一整個管理器,加上大量的 rules。

    對於這裡特定的報錯見最後一行:
    `raise asyncio.TimeoutError from None`
    asuraa
        7
    asuraa  
       2018-03-19 21:07:35 +08:00
    catch 到 timeout 的错误。pass 掉或者是 retry
    locoz
        8
    locoz  
       2018-03-19 21:43:33 +08:00 via Android
    超时 加重试机制就好了
    gouchaoer
        9
    gouchaoer  
       2018-03-19 22:25:57 +08:00 via Android
    async 就跟 gpl 一样具有传染性,有多线程吧
    alexred
        10
    alexred  
    OP
       2018-03-19 23:55:10 +08:00
    @gouchaoer 你想说的是 GIL 吗
    linhanqiu
        11
    linhanqiu  
       2018-03-20 12:23:12 +08:00
    加个重试头就好了,见 aiohttp3.0 的重试机制。
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   我们的愿景   ·   实用小工具   ·   3237 人在线   最高记录 6543   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 26ms · UTC 10:48 · PVG 18:48 · LAX 03:48 · JFK 06:48
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.