轻量异步爬虫框架 aspider ,基于 asyncio
对于单页面,只要实现框架定义的 Item 就好:
import asyncio
from aspider import AttrField, TextField, Item
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
async def clean_title(self, value):
return value
items = asyncio.get_event_loop().run_until_complete(HackerNewsItem.get_items(url="https://news.ycombinator.com/"))
for item in items:
print(item.title, item.url)
Notorious ‘ Hijack Factory ’ Shunned from Web https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/
......
对于多页面的网站,使用 Spider 即可:
import aiofiles
from aspider import AttrField, TextField, Item, Spider
class HackerNewsItem(Item):
target_item = TextField(css_select='tr.athing')
title = TextField(css_select='a.storylink')
url = AttrField(css_select='a.storylink', attr='href')
async def clean_title(self, value):
return value
class HackerNewsSpider(Spider):
start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']
async def parse(self, res):
items = await HackerNewsItem.get_items(html=res.html)
for item in items:
async with aiofiles.open('./hacker_news.txt', 'a') as f:
await f.write(item.title + '\n')
if __name__ == '__main__':
HackerNewsSpider.start()
[2018-07-11 17:50:12,430]-aspider-INFO Spider started!
[2018-07-11 17:50:12,430]-Request-INFO <GET: https://news.ycombinator.com/>
[2018-07-11 17:50:12,456]-Request-INFO <GET: https://news.ycombinator.com/news?p=2>
[2018-07-11 17:50:14,785]-aspider-INFO Time usage: 0:00:02.355062
[2018-07-11 17:50:14,785]-aspider-INFO Spider finished!
同样支持 js 加载:
request = Request("https://www.jianshu.com/", load_js=True)
response = asyncio.get_event_loop().run_until_complete(request.fetch())
print(response.body)
在 Item 以及 Spider 中要是想加载 js,同样只要带上 load_js=True 即可
项目 Github 地址:aspider
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.