目前支持 Unicode 的编程语言都有哪些？

wwqgtxx

2014-05-02 21:16:14 +08:00

java/python3

ochapman

2014-05-02 21:23:32 +08:00

golang utf-8, unicode的实现之一

zzNucker

2014-05-02 21:29:30 +08:00

javascript

Andor_Chen

2014-05-02 21:31:43 +08:00

Ruby 1.9+

jakwings

2014-05-02 21:32:50 +08:00

我傻傻地补一个好了：JavaScript (16bit Unicode unit)

hazard

2014-05-02 22:44:18 +08:00

bash shell?

xierch

2014-05-02 22:47:49 +08:00

我觉得有必要明确一下“支持”是啥意思..

jakwings

2014-05-02 23:04:10 +08:00

@xierch 附言已添加～

jakwings

2014-05-02 23:10:42 +08:00

@hazard =_= 貌似算是支持，假如 bash 的版本够新，而且环境用的字符编码是 UTF-8 兼容的话。echo ${#str} 也能够正确显示长度。

usedname

2014-05-03 01:06:32 +08:00

php6？好像原生支持？

timothyqiu

2014-05-03 01:29:07 +08:00

C++ 自 C++11 起加入了 UTF-8/UTF-16/UTF-32 的支持。

---

附赠一个几乎通用的坑：

如果按照严格定义，很多语言与其说是支持 Unicode，不如说是支持某种特定的 Unicode 编码。

* UTF-8 / UTF-16 两者都是可变长编码。Python / Java / JavaScript 等语言，求字符「𠂊」的长度的结果都是 2，因为「𠂊」的 Unicode 码位 U+2008A 被 UTF-16 编码后是 2 个单元。

* 即便是 UTF-32 这种定长编码，一个编码单元对应一个 Unicode 码位，依旧有问题。因为字符和 Unicode 码位并不都是一一对应的，一个字符可能对应多个码位。例如德语中常见的字符「Ä」，在 Unicode 中有两种表示法：独立字符「Ä」（U+00C4）；以及字母「A」（U+0041）加上组合字符「¨」（U+0308）。按照 Unicode 标准，这两种表示法应该被认为是同一个字符。但是绝大多数语言里，使用第二种表示法的字符串 "\u0041\u0308" 虽然可以正常显示出「Ä」，但是对其取长度依旧是 2。

尤其是第二点，目前几乎没有语言能保证从字符串中取得正确的字符个数。

blacktulip

2014-05-03 01:30:59 +08:00

@timothyqiu

 irb
2.1.1 :001 > "𠂊".length
=> 1

blacktulip

2014-05-03 01:31:51 +08:00

 irb
2.1.1 :001 > "𠂊".length
=> 1
2.1.1 :002 > "Ä".length
=> 1

timothyqiu

2014-05-03 01:37:02 +08:00

@blacktulip 是的，我回复之前试过 Ruby，所以没有列上去……

很多语言的「字符串长度」功能直接返回的是编码单元个数。Ruby 要么是以 UTF-32 存储的字符串的，要么是在求字符串长度时先将字符串还原成了码位。（Ruby 只学过皮毛，不是很明白）

timothyqiu

2014-05-03 01:38:55 +08:00

@blacktulip Ä 的例子需要用转义符方式写。毕竟直接写 Ä 可能直接就用 \u00c4 表示了。

blacktulip

2014-05-03 01:45:30 +08:00

@timothyqiu 嗯， Ruby 里面每个字符串都有自己的编码，可以看看这个 http://yokolet.blogspot.co.uk/2009/07/design-and-implementation-of-ruby-m17n.html

"Ruby multilingualization (M17N) of Ruby 1.9 uses the code set
independent model (CSI) while many other languages use the Unicode
normalization model."

"Under the CSI model, all encodings are handled equally, which means,
Unicode is one of character sets. The most remarkable feature of the
CSI model is that the model does not require a character code
conversion since external and internal character codes are identical.
Thus, the cost for conversion can be eliminated. Besides, we can keep
away from unexpected information loss caused by the conversion,
especially by cutting bits or bytes off. Ruby uses the CSI model, so
do Solaris, Citrus, or other system based on the C library that does
not use __STDC_ISO_10646__."

"Moreover, it is possible to handle various character sets even though
they are not based on Unicode."

skydiver

2014-05-03 01:52:04 +08:00

@timothyqiu Python3里面是对的。

In [1]: len('𠂊')
Out[1]: 1

timothyqiu

2014-05-03 02:27:31 +08:00

@skydiver 谢谢～我找了下，这应该是 Python 3.3 引入的默认行为(PEP 393)。

2.1 < Python < 3.3 的版本可以在编译时通过添加相应的编译选项选择使用 UTF-32 而不是 UTF-16 作为
unicode 的编码。

Python <= 2.1 的版本，只支持 UTF-16，确切地说，只支持 Unicode BMP。

jakwings

2014-05-03 04:19:18 +08:00

@blacktulip Ruby 2.1.1p76
irb> "Ä".length
=> 1
irb> "Ä".length
=> 2

@skydiver Python3.3.5
print(len('Ä'))
#=> 1
print(len('Ä'))
#=> 2

看来组合字符要靠查编码表当 0 来算了……硬伤……

est

2014-05-03 08:53:28 +08:00

@zzNucker javascript 可以说只是支持ucs2而不是支持unicode。