C++有没有什么方法能方便地逐字索引 UTF-8 字符串

2017-01-03 10:55:41 +08:00

fyyz

之前一直以为 C++11 的 std::wstring 就是为了 UTF-8 设计的，后来看了 UTF-8 的相关规范，才明白不是那么一回事。

现在我有一个字符串，是 UTF-8 编码的，如下：

abcABC123 中文

我需要逐字符索引，大致代码如下：

for(auto c:str)
{
    std::cout << c << std::endl;
}

需要这样的结果：

a
b
c
A
B
C
1
2
3
中
文

不知道有什么方法可以优雅地实现这个功能？

====================分割线====================

试过这些写法，都不可行。

#include <iostream>
#include <string>

int main()
{
        std::string str = "abcABC123 中文";
        for(auto c:str)
        {
                std::cout << c << std::endl;
        }
        return 0;
}

# g++ a.cpp -std=c++11
# ./a.out
a
b
c
A
B
C
1
2
3
▒
▒
▒
▒
▒
▒

#include <iostream>
#include <string>

int main()
{
        std::wstring str = L"abcABC123 中文";
        for(auto c:str)
        {
                std::wcout << c << std::endl;
        }
        return 0;
}

# g++ a.cpp -std=c++11
# ./a.out
a
b
c
A
B
C
1
2
3
?
?

2434 次点击

所在节点

14 条回复

2017-01-03 11:08:20 +08:00

1. 前者

c++11 貌似有一套办法，具体不清楚。
我这有个 C 版本的， utf8_decode 逐字进行即可
https://github.com/fy0/python_lite/blob/master/src/deps/utf8_lite.h

2. 后者

setlocale(LC_CTYPE, "");

fengjianxinghun

2017-01-03 11:18:08 +08:00

static inline bool u8_is_ascii(char c)
{
return !(c & (1 << 7));
}

static inline bool u8_is_noascii_char(char c)
{
return !((c >> 6 & 0x3) ^ 0x2);
}

static inline int u8_is_noascii_head(char c)
{
if (!((c >> 5 & 0x7) ^ 0x6)) return 2;
if (!((c >> 4 & 0xF) ^ 0xE)) return 3;
if (!((c >> 3 & 0x1F) ^ 0x1E)) return 4;
return 0;
}

static inline size_t u8_len(const std::string& str)
{
int _index = 0;
for (int i = 0; i < str.size();) {
char c = str.at(i);
int size = u8_is_noascii_head(c);
if (size > 0) {
i += size;
} else if (u8_is_ascii(c)) {
++i;
} else {
return -1;
}

++_index;
}

return _index;
}

static inline std::string u8_at(const std::string& str, int index)
{
std::string t;
int _index = 0;
for (int i = 0; i < str.size();) {
char c = str.at(i);
int size = u8_is_noascii_head(c);
if (size > 0) {
if (_index == index) {
t.assign(str.begin() + i, str.begin() + i + size);
goto quit;
} else {
i += size;
++_index;
}
continue;
} else if (u8_is_ascii(c)) {
if (_index == index) {
t = std::string(1, c);
goto quit;
} else {
++_index;
}
} else {
goto quit;
}
++i;
}

overflow:
console::error("buffer overflow");
exit(-1);

quit:
return t;
}

static inline std::vector<std::string> u8_each_split(const std::string& str)
{
std::vector<std::string> re;
std::string t;
for (int i = 0; i < str.size();) {
char c = str.at(i);
int size = u8_is_noascii_head(c);
if (size > 0) {
t.clear();
t.assign(str.begin() + i, str.begin() + i + size);
i += size;
re.push_back(t);
continue;
} else if (u8_is_ascii(c)) {
re.push_back(std::string(1, c));
} else {
goto quit;
}
++i;
}
quit:
return re;
}

fengjianxinghun

2017-01-03 11:19:49 +08:00

https://github.com/fengjian/fmesh_engine/blob/master/engine/include/utils/str.h

progmboy

2017-01-03 11:27:43 +08:00

UTF8-CPP

shuax

2017-01-03 11:33:10 +08:00

2 其实是可以的，只不过没显示出来而已

fyyz

2017-01-03 11:37:36 +08:00

看了下楼上各位，怎么都是自己撸轮子的啊？有没有 STL 或者 BOOST 里的方法啊？毕竟 UTF-8 也算是用得非常多的编码了。

fyyz

2017-01-03 11:38:47 +08:00

@shuax 我也不知道为什么没有显示，系统 CentOS 7 ， locale 是 en_us.UTF-8 。在 vim 里可以显示字符串里的中文，但是一运行就输出成问号了。

QAPTEAWH

2017-01-03 11:57:48 +08:00

这种是变长多字节编码，你不能在每个字节后输出 endl 。中文一般是 3byte ，也不能用 wchar 。
写得丑勿拍
for (auto it = str.begin(), it2 = str.begin(); it2 != str.end(); ) {
utf8::next(it2, str.end());
while (it < it2) {
cout << *it;
++it;
}
cout << endl;
}

21grams

2017-01-03 12:00:18 +08:00

先把 utf8 转成宽字节，然后就可以按下标索引了

2017-01-03 12:00:38 +08:00

@fyyz 问号看 2 ， locale 的原因。
另外 C++11 不是有 u8"asd" 字符串？你试试呗

dynastysea

2017-01-03 16:36:42 +08:00

for ( iter = str.begin(); iter != str.end(); )
{
uint8_t chr = uint8_t(*iter);
// 保证单个 UTF-8 字符不会被拆分
if ((chr >> 7) == 0)
{
iter += 1;
}
}
else if ((chr >> 5) == 0x6)
{
// 110 开头，后面还有 1B
iter += 2;
}
else if ((chr >> 4) == 0xE)
{
// 1110 开头，后面还有 2B
iter += 3;
}
else if ((chr >> 3) == 0x1E)
{
// 11110 开头，后面还有 3B
iter += 4;
}
else
{
// 应该不会有这种情况
iter += 1;
}
}

楼上给的一堆都太复杂了，我这个才是王道，简单易用易理解

wutiantong

2017-01-03 18:28:06 +08:00

@fyyz

http://en.cppreference.com/w/cpp/locale/codecvt

思路是转成 UTF32 （ std::u32string ），就会自然分隔开了，如果需要还可以再转回 UTF8

lain0

2017-01-03 20:36:13 +08:00

https://stackoverflow.com/questions/26074090/iterating-through-a-utf-8-string-in-c11

try google before you ask here :)

edimetia3d

2017-01-03 20:51:17 +08:00

@wutiantong 赞同
这显然是变长编码的问题,使用定长编码 UTF32 或者 UCS4 就行了

```
char data[]=u8"中文";
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf8_ucs4_cvt;
std::u32string ucs4_cvt_str = utf8_ucs4_cvt.from_bytes(data); // utf-8 to ucs4
std::string u8_str_from_ucs4 = utf8_ucs4_cvt.to_bytes(ucs4_cvt_str); // ucs4 to utf-8
```
详情可参考: http://blog.poxiao.me/p/unicode-character-encoding-conversion-in-cpp11/

第 1 页／共 1 页

这是一个专为移动设备优化的页面（即为了让你能够在 Google 搜索结果里秒开这个页面），如果你希望参与 V2EX 社区的讨论，你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/331843

V2EX 是创意工作者们的社区，是一个分享自己正在做的有趣事物、交流想法，可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.