使用 scrapy 爬新浪微博,发现网页被重定向之后没法进入自定义的解析函数了,怎么回事呢?

2017-01-08 17:57:03 +08:00
 XDMonkey

爬其他的 URL 都可以啊,是因为新浪微博被重定向的原因吗?

import scrapy
import re 
from scrapy.selector import Selector
from scrapy.http import Request
from tutorial.items import DmozItem
from string import maketrans
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
def extractData(regex, content, index=1): 
    r = '0' 
    p = re.compile(regex) 
    m = p.search(content) 
    if m: 
        r = m.group(index) 
    return r 
class DmozSpider(CrawlSpider):
    name = "dmoz"
    allowed_domains = ["weibo.com"]
    download_delay = 2
    rules=[
        Rule(LinkExtractor(allow=('/')),callback='parse_item',follow=True)
        ]

    headers = {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate, sdch, br",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Connection": "keep-alive",
        # "Host": "login.sina.com.cn",
        "Referer": "http://weibo.com/",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36"
    }
    cookies = {
        'ALF':'我的 cookie',
        'Apache':'我的 cookie',
        'SCF':'我的 cookie',
        'SINAGLOBAL':'我的 cookie',
        'SSOLoginState':'我的 cookie',
        'SUB':'我的 cookie',
        'SUBP':'我的 cookie',
        'SUHB':'我的 cookie',
        'TC-Page-G0':'我的 cookie',
        'TC-Ugrow-G0':'我的 cookie',
        'TC-V5-G0':'我的 cookie',
        'ULV':'我的 cookie',
        'UOR':'我的 cookie',
        'WBStorage':'我的 cookie',
        'YF-Page-G0':'我的 cookie',
        'YF-Ugrow-G0':'我的 cookie',
        'YF-V5-G0':'我的 cookie',
        '_s_tentry':'-',
        'log_sid_t':'我的 cookie',
        'un':'我的 cookie',
    }
    def start_requests(self):
        return [Request("http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1",cookies = self.cookies,headers=self.headers)]

    def parse_item(self, response):
        print "comehere!"
        regexID=r'class=\\"username\\">(.*)\<\\/h1>'
        content=response.body
        item=DmozItem()
        ID=extractData(regexID,content,1)
        item['ID']=ID
        print ID       
        yield item

控制台输出如下:

2017-01-08 17:51:34 [scrapy.core.engine] INFO: Spider opened
2017-01-08 17:51:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-08 17:51:34 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-08 17:51:46 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://login.sina.com.cn/sso/login.php?url=http%3A%2F%2Fweibo.com%2Fu%2F2010226570%3Frefer_flag%3D1001030101_%26is_all%3D1&_rand=1483869098.691&gateway=1&service=miniblog&entry=miniblog&useticket=1&returntype=META&sudaref=http%3A%2F%2Fweibo.com%2F&_client_version=0.6.23> from <GET http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1>
2017-01-08 17:51:47 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (meta refresh) to <GET http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1&sudaref=weibo.com&retcode=6102> from <GET http://login.sina.com.cn/sso/login.php?url=http%3A%2F%2Fweibo.com%2Fu%2F2010226570%3Frefer_flag%3D1001030101_%26is_all%3D1&_rand=1483869098.691&gateway=1&service=miniblog&entry=miniblog&useticket=1&returntype=META&sudaref=http%3A%2F%2Fweibo.com%2F&_client_version=0.6.23>
2017-01-08 17:51:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://weibo.com/u/2010226570?refer_flag=1001030101_&is_all=1&sudaref=weibo.com&retcode=6102> (referer: http://weibo.com/)
2017-01-08 17:51:49 [scrapy.core.engine] INFO: Closing spider (finished)
10637 次点击
所在节点    Python
7 条回复
XDMonkey
2017-01-08 18:11:00 +08:00
gouchaoer
2017-01-08 18:33:22 +08:00
你不知道只有梁博能搞微博么?
hiluluke
2017-01-08 18:45:26 +08:00
随便塞点 cookie 就不会重定向了。。。
XDMonkey
2017-01-08 19:10:03 +08:00
@hiluluke 塞了 cookie 重定向之后登录成功了 上面代码里也有 cookie😫
XDMonkey
2017-01-08 19:13:34 +08:00
@gouchaoer 啊咧? 我不太懂?求指教
sunwei0325
2017-01-08 23:56:30 +08:00
建议实施 wap 版的微博
XDMonkey
2017-07-16 13:12:00 +08:00
@sunwei0325 多谢

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/333099

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX