[新人求助] 关于 scrapy 项目中 scrapy.Request 没有回调的问题

2019-01-09 17:38:52 +08:00
 15874103329
import scrapy

from Demo.items import DemoItem


class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quores.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = DemoItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item


next = response.css('.pager .next a::attr("href")').extract_first()
url = response.urljoin(next)
if next:
yield scrapy.Request(url=url,callback=self.parse)
2435 次点击
所在节点    Python
10 条回复
15874103329
2019-01-09 17:39:36 +08:00
按照教程里写的,但是我这代码只爬取了一页就结束了,求大佬帮忙看看
15874103329
2019-01-09 21:38:06 +08:00
求助啊
Leigg
2019-01-10 00:00:46 +08:00
把 next 打印出来
carry110
2019-01-10 04:48:30 +08:00
next 哪行,不要 extract_first ()试试。
carry110
2019-01-10 10:55:22 +08:00
把 if next:去掉就能行了,亲测!
15874103329
2019-01-10 10:58:01 +08:00
@Leigg next 打印出来是 '/page/2/'
url 是'http://quotes.toscrape.com/page/2/'
15874103329
2019-01-10 11:00:16 +08:00
@carry110
我这还是只打印了一页,不知啥情况
Leigg
2019-01-10 12:09:33 +08:00
贴出 scrapy 结束的日志
15874103329
2019-01-10 12:49:54 +08:00
@Leigg
2019-01-10 11:35:18 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'http': <GET http://http//quotes.toscrape.com/page/2>
2019-01-10 11:35:18 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-10 11:35:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 1, 10, 3, 35, 18, 314550),
'item_scraped_count': 10,
'log_count/DEBUG': 14,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 9,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 1, 10, 3, 35, 14, 371325)}
2019-01-10 11:35:18 [scrapy.core.engine] INFO: Spider closed (finished)
15874103329
2019-01-10 13:00:39 +08:00
已解决,修改代码为 yield scrapy.http.Request(url, callback=self.parse, dont_filter=True)

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/525440

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX