用 pyspider 抓取苹果团的主题标题乱码怎么解决？

from pyspider.libs.base_handler import *
import re

class Handler(BaseHandler):

    '''
    this is a sample handler
    '''
    crawl_config = {
        "headers": {
            "User-Agent": "BaiDuSpider",
        }
    }

    @every(minutes=24 * 60, seconds=0)
    def on_start(self):
        self.crawl('http://www.appletuan.com/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('.topic_title a[href^="http://"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('.header h1').text(),
        }

返回的标题是乱码，不过同样用于v2ex却没有，这个怎么解决？

binux

2014-11-27 14:57:40 +08:00

```
doc = pyquery.PyQuery(response.text)
return {
"url": response.url,
"title": doc('.header h1').text(),
}
```

这就是 lxml 的蛋疼之处，给它 unicode 它有的时候它不认，给它 bytes 它又处理不好
间 https://github.com/binux/pyspider/pull/24

hillw4h

2014-11-27 15:13:00 +08:00

@binux 照你的方法修改之后正常了。谢谢。
不过坦白说，没懂这样改为什么就可以了，哈哈。（issue也没看懂 - -!!）

hillw4h

2014-11-27 15:20:38 +08:00

@binux 之前还以为是编码的问题。谢谢解答！
btw：好喜欢你开发的这个pyspider，连我这种小菜也能抓到好多自己想抓的数据。哈哈。

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/149659

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.