scrapy的json输出问题

2013-12-26 14:33:56 +08:00
 brucebot
我在使用scrapy抓取youtube上关于工业机器人视频的标题与链接,希望输出到json文件里面,
以下是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log
from youtube.items import YoutubeItem

class YoutubeSpider(CrawlSpider):
name = "youtube"
allowed_domains = ["youtube.com"]
start_urls = ['http://www.youtube.com/results?search_query=industrial+robot+Assembling&page=%d' %n for n in range (1,2)]
rules = ()
def parse(self,response):
print "Start scrapping youtube videos info..."
hxs=HtmlXPathSelector(response)
bases = hxs.select('//*[@id="results"]//*[@id="search-results"]')
items=[]
for base in bases:
item = YoutubeItem()
t_title=base.select('//*[@id="search-results"]/li/div/h3/a//text()').extract()
item['title']=map(lambda s: s.strip(), t_title)
item['linkID'] = base.select('//*[@id="search-results"]/li/div/h3/a/@href').extract()
#t_desc=base.select('//*[@id="search-results"]/li/div[2]/div[2]/text()')
#t_desc="".join(base.select('//*[@id="search-results"]/li/div[2]/div[2]/text()').extract_unquoted())
#item['description']=t_desc
#item['thumbnail'] = base.select('//*[@id="search-results"]/li/div/a//img/@src').extract()
items.append(item)
return(items)

但是输出的结果是:
{'linkID': [u'/watch?v=iFKbpbe_9pw',
u'/watch?v=Fnlzl6sBOsA',
u'/watch?v=QbrqeJRy0hY',
u'/watch?v=u6-d5VkOB3I',
u'/watch?v=9--qNRr1VZI',
u'/watch?v=89prwGUZjM0',
u'/watch?v=txahbz9eswk',
u'/watch?v=52ptIgooZ64',
u'/watch?v=goNOPztC_qE',
u'/watch?v=daH5Xs11uQc',
u'/watch?v=V2V3Cu0nWvg',
u'/watch?v=TQwN-YeWXfs',
u'/watch?v=aWDAG3fz-ec',
u'/watch?v=Xmn06cpqngs',
u'/watch?v=iuaAEDrrVyg',
u'/watch?v=TG4yzjV4d8w&list=PLECC02EA2EAE0E159',
u'/watch?v=GCCW9O7IKhY',
u'/watch?v=O8HwEXDLug8',
u'/watch?v=yYCHUT79tFM',
u'/watch?v=82w_r2D1Ooo'],
'title': [u'Assembly Line Robot Arms on How Do They Do It',
u'Engine Assembly Robots - FANUC Robot Industrial Automation',
u'LR Mate 200iC USB Memory Stick Assembly Robot - FANUC Robot Industrial Automation',
u'R-2000iA Automotive Assembly Robots - FANUC Robotics Industrial Automation',
u'M-3iA Flexible Solar Collector Assembly Robot - FANUC Robotics Industrial Automation',
u'LR Mate 200iB Gas Can Assembly Robot - FANUC Robotics Industrial Automation',
u'ABB Robotics - Assembly of electrical sockets',
u'M-1iA Circuit Board Assembly Robots - FANUC Robotics Industrial Automation',
u'ABB Robotics - Assembly of digital camera',
u'M-1iA LED Lens Assembly Robots - FANUC Robotics Industrial Automation',
u'LR Mate Small Piston Engine Assembly Robots - FANUC Robotics Industrial Automation',
u'M-1iA Keyboard Assembly Robots - FANUC Robotics Industrial Automation',
u'M-1iA Ball Bearing Assembly Robot - FANUC Robotics Industrial Automation',
u'M-1iA Intelligent Gear Assembly Robot - FANUC Robotics Industrial Automation',
u'LR Mate 200iC Small Part Assembly Robots - FANUC Robotics Industrial Automation',
u'Assembly Robots - FANUC Robotics Application Videos',
u'M-1iA/LR Mate 200iC Solar Panel Assembly Robots - FANUC Robotics Industrial Automation',
u'ABB Robotics - Assembly',
u'Yaskawa Motoman SDA10 Robot Assembly Video',
u'Toyota Camry Hybrid Factory Robots']}


而我想要的是linkID与title一一对应起来,这是哪里有问题吗?
6583 次点击
所在节点    Python
11 条回复
greatghoul
2013-12-26 14:44:02 +08:00
我觉得你把代码帖到 gist 里面再贴链接出来比较好一些。
youtube 应该有 API 吧,有没有考虑不走抓取就做成事呢?
brucebot
2013-12-26 14:50:26 +08:00
@greatghoul 想改来着,但是好像过时间了,用scrapy也是学习一下

代码在这里
https://gist.github.com/brucebot/734ddc9469d3970fdc02
brucebot
2013-12-26 14:50:45 +08:00
734ddc9469d3970fdc02
brucebot
2013-12-26 14:51:43 +08:00
muzuiget
2013-12-26 14:54:25 +08:00
用 zip 来拼一下就行咯
brucebot
2013-12-26 14:58:55 +08:00
@greatghoul @livid
我的错,我建的一个secrec gist

https://gist.github.com/brucebot/8130663
brucebot
2013-12-26 14:59:31 +08:00
@muzuiget 重新拼?同样的例子,输出是正常的,我很奇怪这个
muzuiget
2013-12-26 20:47:40 +08:00
@brucebot 这样 zip(result['linkID'], result['title'])

你的 parse() 里 items 是个 list,但是返回是个 dict,肯定哪里被二次转换过了。
rayind
2013-12-27 11:20:13 +08:00
xpath选取那一块写错
这几句:
bases = hxs.select('//*[@id="results"]//*[@id="search-results"]/*')

t_title=base.select('div/h3/a//text()').extract()

item['linkID'] = base.select('div/h3/a/@href').extract()
brucebot
2013-12-27 13:42:59 +08:00
@rayind 非常感谢,终于输出正常了
brucebot
2013-12-27 13:43:48 +08:00
@muzuiget 还是谢谢你,可是不是特别熟悉这个,用@rayind的方法成功了

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/94641

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX