https://github.com/intohole/xspider 是再重复造轮子!但让我们一起熟悉
main.py:
from xspider.spider.spider import BaseSpider
from xspider.filters import urlfilter
from kuailiyu import KuaiLiYu
if __name__ == "__main__":
spider = BaseSpider(name = "kuailiyu" , page_processor = KuaiLiYu() , allow_site = ["kuailiyu.cyzone.cn"] , start_urls = ["http://kuailiyu.cyzone.cn/"])
spider.url_filters.append(urlfilter.UrlRegxFilter(["kuailiyu.cyzone.cn/article/[0-9]*\.html$","kuailiyu.cyzone.cn/index_[0-9]+.html$"]))
spider.start()
kuailiyu.py
from xspider import processor
from xspider.selector import xpath_selector
from xspider import model
class KuaiLiYu(processor.PageProcessor.PageProcessor):
def __init__(self):
super(KuaiLiYu , self).__init__()
self.title_extractor = xpath_selector.XpathSelector(path = "//title/text()")
def process(self , page , spider):
items = model.fileds.Fileds()
items["title"] = self.title_extractor.find(page)
items["url"] = page.url
return items
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.