Python2 显示 unicode 的问题

用户想要看的是 u'中文' 而不是 u'\u4e2d\u6587'，但是在 Python2 中有时并不能实现。

转译

转义字符是这样一个字符，标志着在一个字符序列中出现在它之后的后续几个字符采取一种替代解释[1]。

>>> ["\u4e2d\u6587"] == ["中文"]
True
>>> '["\u4e2d\u6587"]' == '["中文"]'
True

# 取消转义后则不相等
>>> r'["\u4e2d\u6587"]' == r'["中文"]'
False
>>> r'["\u4e2d\u6587"]'
'["\\u4e2d\\u6587"]'
>>> r'["中文"]'
'["中文"]'

由于各种语言的转义机制是不一样的，所以传递 '["\u4e2d\u6587"]' 到浏览器上，浏览器显示的是未转义的 '["\u4e2d\u6587"]'。

str()

Python2 str is bytes.

unicode encode to bytes
bytes decode to unicode

>>> b = u'中文'.encode('utf-8')
>>> type(u'中文')
<type 'unicode'>
>>> type(b)
<type 'str'>
>>> b
'\xe4\xb8\xad\xe6\x96\x87'
>>> b.decode('utf-8') == u'中文'
True

对于 unicode，str() 相当于以默认 encoding 编码：

# -*- coding: utf-8 -*-

import sys
try:
    str(u'中文')
except UnicodeEncodeError:
    print(u'不能使用 {encoding} 编码非 {encoding} 字符'.format(encoding=sys.getdefaultencoding())) # 不能使用 ascii 编码非 ascii 字符

reload(sys)
sys.setdefaultencoding('UTF8')
print(sys.getdefaultencoding()) # UTF8

print(str(u'中文')) # 中文
print(str(u'中文') == u'中文'.encode(sys.getdefaultencoding())) # True

容器内的 unicode 显示

容器指一个类、数据结构或者一个抽象数据类型，对应的实例是其他对象的集合。在 Python 中，list、dict 都是容器。

在 Python 中，str(container) 对每个 item 调用 repr() 而不是 str() 以获取对应的字符串[2]。而在 Python2 中，repr() 返回一个对象的可打印字符串形式，但是会使用 \x、\u 或者 \U 转译字符串中的非 ASCII 字符[3]。

所以我们会看到这样的现象

>>> print({u'\u4e2d\u6587': 1})
{u'\u4e2d\u6587': 1}

而在 Python3 中，由于默认编码是 UTF-8，所以 repr() 只会转译超出 UTF-8 范围的象形符号（ glyphs ），所以在 Python3 中

>>> print({u'\u4e2d\u6587': 1})
{'中文': 1}

print 做了什么

Python 将 print() 中的参数转换为 bytes str，然后输出到 sys.stdout 上。

目前不清楚如何转换的，只知道不是用 str() 转换：

# -*- coding: utf-8 -*-

print(u'中文') # 中文
print(str(u'中文')) # UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

处理

显示时将 unicode 以 utf-8 编码为 bytes。

由于 Python2 的默认编码是 ASCII 并且 str() 不支持 encoding 参数，所以不能使用 str()。

更改默认编码也不可取[4]。

目前我找到的办法是使用 json.dumps(obj, ensure_ascii=False)

If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the result is a str instance consisting of ASCII characters only. If ensure_ascii is false, some chunks written to fp may be unicode instances.

>>> print(json.dumps([u'\u4e2d\u6587', ], ensure_ascii=False))
["中文"]
>>> print(json.dumps([u'\u4e2d\u6587', ], ensure_ascii=True))
["\u4e2d\u6587"]

使用 json 处理字符串的问题

多次 dumps 引发会转译 \

# coding: utf-8

import json

d = {
    json.dumps({u'中文': 'u 中文'}, ensure_ascii=False): 'value'
}
print(d)  # {u'{"\u4e2d\u6587": "\u4e2d\u6587"}': u'value'}
print(json.dumps(d, ensure_ascii=False))  # {"{\"中文\": \"中文\"}": "value"}

不知道 dumps 了几次

所以我的问题是，在 Python2 中如何将容器转换为 unicode 以及正确显示 unicode。

请不要说转 Python3，我就想找出一个在 Python2 中好的处理方法，并彻底弄清楚这个问题。而不是转了 Python3 之后，遇到编码问题又是一脸懵逼。

Python2 显示 unicode 的问题

转译

str()

容器内的 unicode 显示

print 做了什么

处理

使用 json 处理字符串的问题

参考