Python scrapy 高手看过来

2016-05-01 17:03:03 +08:00
 websitevisor

我现在有一个初始网址获得网页内容是:

http://a.com/q=boy&alias=aps

["boy",["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"],[{"nodes":[{"name":"Boys' Clothing","alias":"fashion-boys-clothing"},{"name":"Amazon Fashion","alias":"fashion-brands"},{"name":"Baby","alias":"baby-products"},{"name":"Baby Boys' Clothing & Shoes","alias":"fashion-baby-boys"}]},{},{},{},{},{},{},{},{},{}],[]]

而红色这一部分是我所需要抓取的部分: 同时也是下一次查找时所需要带上的参数,结果也类似下部分, 我要做的就是把所有红色部分的全部提取出来

["boy",["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"],[{"nodes":[{"name":"Boys' Clothing","alias":"fashion-boys-clothing"},{"name":"Amazon Fashion","alias":"fashion-brands"},{"name":"Baby","alias":"baby-products"},{"name":"Baby Boys' Clothing & Shoes","alias":"fashion-baby-boys"}]},{},{},{},{},{},{},{},{},{}],[]]

我的思路如下:

class MYItem(scrapy.Item): Keyword = scrapy.Field() Nodes = scrapy.Field()

class Spider(CrawlSpider): name = 'mySpider' allowed_domains = ['a.com'] start_urls = ['http://a.com/q=boy&alias=aps']

def parse(self, response):
       #suggestvalueArr 得到这样一个字符串数组  ["boys clothes","boys shoes","boys toys","boys socks","boyfriend gifts","boys shorts","boys underwear","boys sandals","boys","boys baseball pants"]
        for sel in suggestvalueArr:
            item = MYItem()
            item['Keyword'] = sel
            item['Nodes'] = nodes
            yield item

        for sel in suggestvalueArr:
            tmpurl = "http://a.com&q=%s&search-alias=aps"%sel
            yield scrapy.Request(tmpurl, callback=self.parse)

我为什么感觉我的结果没有完全抓取完就结束了,有没有人看出问题所在了?谢谢了

2384 次点击
所在节点    Python
0 条回复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/275667

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX