我现在有一个初始网址获得网页内容是:
["boy",["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"],[{"nodes":[{"name":"Boys' Clothing","alias":"fashion-boys-clothing"},{"name":"Amazon Fashion","alias":"fashion-brands"},{"name":"Baby","alias":"baby-products"},{"name":"Baby Boys' Clothing & Shoes","alias":"fashion-baby-boys"}]},{},{},{},{},{},{},{},{},{}],[]]
而红色这一部分是我所需要抓取的部分: 同时也是下一次查找时所需要带上的参数,结果也类似下部分, 我要做的就是把所有红色部分的全部提取出来
["boy",["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"],[{"nodes":[{"name":"Boys' Clothing","alias":"fashion-boys-clothing"},{"name":"Amazon Fashion","alias":"fashion-brands"},{"name":"Baby","alias":"baby-products"},{"name":"Baby Boys' Clothing & Shoes","alias":"fashion-baby-boys"}]},{},{},{},{},{},{},{},{},{}],[]]
我的思路如下:
class MYItem(scrapy.Item): Keyword = scrapy.Field() Nodes = scrapy.Field()
class Spider(CrawlSpider): name = 'mySpider' allowed_domains = ['a.com'] start_urls = ['http://a.com/q=boy&alias=aps']
def parse(self, response):
#suggestvalueArr 得到这样一个字符串数组 ["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"]
for sel in suggestvalueArr:
item = MYItem()
item['Keyword'] = sel
item['Nodes'] = nodes
yield item
for sel in suggestvalueArr:
tmpurl = "http://a.com&q=%s&search-alias=aps"%sel
yield scrapy.Request(tmpurl, callback=self.parse)
我为什么感觉我的结果没有完全抓取完就结束了,有没有人看出问题所在了?谢谢了
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.