Scrapy CrawlSpider rules 中的 callback 未被调用

2020-03-13 13:14:22 +08:00
 gsz2015
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class CrspiderSpider(CrawlSpider):
    name = 'crSpider'
    allowed_domains = ['china-railway.com.cn']
    start_urls = ['http://www.china-railway.com.cn/xwzx/ywsl/']

    rules = (
        Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/[a-zA-Z]+/'), follow=True),
        Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/[a-zA-Z]+/index_\d+.html'), follow=True),
        Rule(LinkExtractor(allow=r'http://www.china-railway.com.cn/xwzx/.+t\d{8}_\d{6}.html'), callback='parse_item')
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        print('-' * 40, '进入回调', '-' * 40, )
        newsName = response.xpath('//h1').get()
        print(newsName)
       

    # def parse(self, response):
    #     item = {}
    #     print('-' * 40, '进入 parse 回调', '-' * 40, )
    #     print(response.text)
    #     newsName = response.xpath('//h1').get()
    #     return item

2020-03-13 12:38:25 [scrapy.core.engine] INFO: Spider opened
2020-03-13 12:38:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-03-13 12:38:25 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-03-13 12:38:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/> (referer: None)
2020-03-13 12:38:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:26 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.china-railway.com.cn/xwzx/ywsl/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2020-03-13 12:38:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200304_101019.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200305_101067.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200305_101100.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200306_101120.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200307_101174.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200310_101326.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
2020-03-13 12:38:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.china-railway.com.cn/xwzx/ywsl/202003/t20200311_101362.html> (referer: http://www.china-railway.com.cn/xwzx/ywsl/)
1908 次点击
所在节点    Python
5 条回复
kasper4649
2020-03-13 13:24:35 +08:00
第三个 rule 后面也加个逗号?
gsz2015
2020-03-13 13:30:51 +08:00
@kasper4649 加不加逗号都试过了😂,难道是 Scrapy 2.0 的问题吗
IanPeverell
2020-03-13 16:12:40 +08:00
你把单引号去掉试试,你传的应该是函数不是字符串
IanPeverell
2020-03-13 16:26:53 +08:00
@IanPeverell 哦,字符串也可以(捂脸逃)
gsz2015
2020-03-13 16:35:12 +08:00
@IanPeverell 刚刚解决了,是正则的问题,第一个正则也能匹配到第三个正则的 url,所以一直没有调用到 callback 😂

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/652470

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX