python3.6 字符编码问题

准备写个爬虫，监控一个网页，如果有更新就将更新的内容采集并邮件通知我，结果开始就卡住了。。。

环境 + IDE：win10, python3.6.4, vscode
要监控的 URL 为： http://www.wh-ccic.com.cn/node_13613.htm
我需要的内容为每个月份里面的图片，及 http://www.wh-ccic.com.cn/content/2018-05/08/content_443454.htm 和 http://www.wh-ccic.com.cn/content/2018-05/08/content_443453.htm 页面的所有图片，并按月份为文件夹存储

问题：月份提取出来中文显示为乱码，如：2018å¹´05æ

我看了网页源码，有声明 charset=utf-8, 并且我用的是 python3.6，所以比较纳闷为何为出现乱码，在 Chrome 控制台下测试 xpath 时是没毛病的：

然后各种百度、谷歌的找，大部说到是编码问题，一篇篇的关于编码的文章看得脑壳麻，然后按所说的方法都不能解决，特发贴看有遇到同样问题的朋友没

尝试过的方法：

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
编码转换, text.encode('utf-8').decode('unicode_escape')

PS: 打印 requests.get() 的 text 所有中文都显示为乱码

下面为测试的 demo：

import requests
'''
    import re
    import sys
    import io

    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf8')
'''
from lxml import html

url = 'http://www.wh-ccic.com.cn/node_13613.htm'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
}
base_url = 'http://www.wh-ccic.com.cn'
page = requests.get(url, headers=headers)
tree = html.fromstring(page.text)
print(page.text)
all_a = tree.xpath('.//*[@class="STYLE13"]/a')
for a in all_a:
    # print(a.attrs['href'])
    # href = a.attrs['href']
    # title = a.text.replace(u'\xe5', u' ')
    href = a.attrib['href']
    title = a.text
    if '\\u' in title:
        title = title.encode('utf-8').decode('unicode_escape')
        print(title)

Sylv

2018-05-11 01:44:51 +08:00

放上 requests 文档中关于编码的说明：

Encodings

When you receive a response, Requests makes a guess at the encoding to use for decoding the response when you access the Response.text attribute. Requests will first check for an encoding in the HTTP header, and if none is present, will use chardet to attempt to guess the encoding.

The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.

http://docs.python-requests.org/en/latest/user/advanced/#encodings