分享一个空间利用率超高的 Base36 算法

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

This topic created in 661 days ago, the information mentioned may be changed or developed.

很久以前学习 C 语言时写的一个练手项目，在常规的进制转换上做了些改进。

https://github.com/EtherDream/base36_914

进制转换一般分两种。一种将整个输入数据当做一个大数，然后不断模 N 和除 N ，将大数分解成一堆 N 以内的数据，从而实现 BaseN 。这种方案空间效率最高，借助大数库实现也不难，只是大数运算非常耗性能。

另一种方案则是每次取 4 或 8 字节到寄存器中，然后不断模 N 和除 N 进行分解。这种方案性能高，实现简单，但空间利用率可能不高。

本方案 Base36 编码时每次读取 9 字节到一个 u8 和 u64 中，利用除法和取模上小技巧，只需少数计算即可将其分解成 14 个字符。空间利用率为 64.2%，相比 Base36 的理论效率 64.6% 只差 0.5%。（不考虑尾块填充的情况下）

在线演示： https://etherdream.com/base36/

如果存在 BUG 或者有更好的改进方案，欢迎提出建议。

Base36

空间利用率

算法

16 replies • 2024-09-09 13:26:17 +08:00

tool2dx

Sep 7, 2024 via Android

厉害👍🏻，网上大部分代码都是大数除余，很不爽。

V4Exp

Sep 7, 2024

有点意思

wbrobot

Sep 7, 2024

php 是最好的语言

<?php

define('BASE36_ENCODE_TABLE_DEFAULT', array_merge(range('0', '9'), range('a', 'z')));
define('BASE36_DECODE_TABLE_DEFAULT', array_fill(0, 256, 0) + array_combine(range(ord('0'), ord('9')), range(0, 9)) + array_combine(range(ord('a'), ord('z')), range(10, 35)));

function base36_encode_block($plain, $encode_table) {
$lo = 0;
$hi = $plain[8];
for ($i = 0; $i < 8; $i++) {
$lo |= ($plain[$i] << ($i * 8));
}

$D = 36 * 36;
$remainder = $hi * (PHP_INT_MAX % $D + 1) + ($lo % $D);

$code = [];
for ($i = 0; $i < 2; $i++) {
$code[$i] = $encode_table[$remainder % 36];
$remainder = intdiv($remainder, 36);
}

$quotient = $hi * (PHP_INT_MAX / $D) + intdiv($lo, $D) + $remainder;
for ($i = 2; $i < 14; $i++) {
$code[$i] = $encode_table[$quotient % 36];
$quotient = intdiv($quotient, 36);
}

return $code;
}

function base36_decode_block($code, $decode_table) {
$lo = 0;
$hi = 0;

for ($i = 11; $i >= 0; $i--) {
$lo = $lo * 36 + $decode_table[$code[$i]];
}
for ($i = 13; $i >= 12; $i--) {
$hi = $hi * 36 + $decode_table[$code[$i]];
}

$plain = [];
$plain[0] = $lo;

$value = $hi * 18509302102818816 + intdiv($lo, 256);
for ($i = 1; $i < 9; $i++) {
$plain[$i] = ($value >> (($i - 1) * 8)) & 0xFF;
}

return $plain;
}

function base36_encode_last_block($plain, $len, $encode_table) {
$plain_tmp = array_merge(array_fill(0, 9, 0), $plain);
$code = base36_encode_block($plain_tmp, $encode_table);
$code[13] = $encode_table[27 + $len];

return $code;
}

function base36_decode_last_block($code, $decode_table) {
$flag = $decode_table[$code[13]];

if ($flag >= 28) {
$code_tmp = array_slice($code, 0, 13);
$plain_tmp = base36_decode_block($code_tmp, $decode_table);

$len = $flag - 27;
if ($len > 8) {
$len = 8;
}
return array_slice($plain_tmp, 0, $len);
}

return base36_decode_block($code, $decode_table);
}

function base36_encode($plain, $encode_table) {
$code = [];
$len = count($plain);
$src = 0;
$dst = 0;

while ($len >= 9) {
$code_part = base36_encode_block(array_slice($plain, $src, 9), $encode_table);
array_splice($code, $dst, 14, $code_part);
$src += 9;
$dst += 14;
$len -= 9;
}

if ($len > 0) {
$code_part = base36_encode_last_block(array_slice($plain, $src, $len), $len, $encode_table);
array_splice($code, $dst, 14, $code_part);
$dst += 14;
}

return array_slice($code, 0, $dst);
}

function base36_decode($code, $decode_table) {
$plain = [];
$len = count($code);

if ($len <= 0) {
return $plain;
}

$src_last = $len - 14;
$src = 0;
$dst = 0;

while ($src < $src_last) {
$plain_part = base36_decode_block(array_slice($code, $src, 14), $decode_table);
array_splice($plain, $dst, 9, $plain_part);
$src += 14;
$dst += 9;
}
$plain_part = base36_decode_last_block(array_slice($code, $src, 14), $decode_table);
array_splice($plain, $dst, count($plain_part), $plain_part);

return $plain;
}

// Example usage
$plain = array_fill(0, 9, 1);
$encoded = base36_encode($plain, BASE36_ENCODE_TABLE_DEFAULT);
$decoded = base36_decode($encoded, BASE36_DECODE_TABLE_DEFAULT);

print_r($encoded);
print_r($decoded);

?>

nagisaushio

Sep 7, 2024 via Android

base64 空间利用率不是更高吗

nagisaushio

Sep 7, 2024 via Android

好吧，忽略上一条，我理解错了

n0099

Sep 7, 2024

https://github.com/n0099/open-tbm/issues/24

cookii

Sep 7, 2024

Base32 对比 Base64 使用场景上有什么区别啊？

Projection

Sep 7, 2024

没有搞懂你说的空间利用率是什么意思。

假设我每次只读取一个字节的数据而不是 9 字节，那么刚开始会产生一个字符，照你的说法空间利用率不就是 100% 了？

每次读取 9 字节只能保证至少产生 13 个字符（比如刚开始时），有时才会产生 14 个字符。当产生 13 个字符时，你说的空间利用率就是 9 / 13 = 0.69 了。

>>> 36 ** 13 < 256 ** 9 < 36 ** 14
True

你可能是想要找到一个缓存区的大小 m ，满足 `256 ** m >= 36 ** n` 且左右两边占用 bits 差值足够小，但是概念上没有理清。我倒是觉得应该这样算：

```python
import math

base = 36

for m in range(1, 16):
n = int(math.log(256 ** m, 36))
waste = m * 8 - math.log2(36 ** n)
print(f'{m=} bytes, {n=} chars, {waste=} bits')
```

m=1 bytes, n=1 chars, waste=2.830074998557688 bits
m=2 bytes, n=3 chars, waste=0.49022499567306355 bits
m=3 bytes, n=4 chars, waste=3.3202999942307514 bits
m=4 bytes, n=6 chars, waste=0.9804499913461271 bits
m=5 bytes, n=7 chars, waste=3.8105249899038114 bits
m=6 bytes, n=9 chars, waste=1.470674987019187 bits
m=7 bytes, n=10 chars, waste=4.3007499855768785 bits
m=8 bytes, n=12 chars, waste=1.9608999826922542 bits
m=9 bytes, n=13 chars, waste=4.790974981249946 bits
m=10 bytes, n=15 chars, waste=2.451124978365314 bits
m=11 bytes, n=17 chars, waste=0.11127497548069698 bits
m=12 bytes, n=18 chars, waste=2.941349974038374 bits
m=13 bytes, n=20 chars, waste=0.601499971153757 bits
m=14 bytes, n=21 chars, waste=3.431574969711434 bits
m=15 bytes, n=23 chars, waste=1.091724966826817 bits

很明显 9 字节的空间利用率非常低。即便如此浪费的也是几个 bits 而已，在字节这个粒度下感觉空间利用率就是个伪需求。

MoYi123

Sep 7, 2024 via iPhone

base36 encode/decode 不是本来就不用额外内存吗？何来省内存的说法。
另外现在搞这种东西都是用 simd 了，一个个字节处理太慢了

iqoo

Sep 8, 2024

@Projection 写个程序把一个数据包或者一个大文件用 0-9a-z 编码就知道了，膨胀系数越小利用率就越高。

Projection

Sep 8, 2024

@iqoo 任何程序编码出来结果都是一样的，和空间利用率有什么关系

Projection

Sep 8, 2024

我想明白你怎么算的了：

```python
import math

for m in range(1, 16):
n = math.ceil(math.log(256 ** m, 36))
ratio = m / n
print(f'{m=} bytes, {n=} chars, {ratio=:.2%}')
```

m=1 bytes, n=2 chars, ratio=50.00%
m=2 bytes, n=4 chars, ratio=50.00%
m=3 bytes, n=5 chars, ratio=60.00%
m=4 bytes, n=7 chars, ratio=57.14%
m=5 bytes, n=8 chars, ratio=62.50%
m=6 bytes, n=10 chars, ratio=60.00%
m=7 bytes, n=11 chars, ratio=63.64%
m=8 bytes, n=13 chars, ratio=61.54%
m=9 bytes, n=14 chars, ratio=64.29%
m=10 bytes, n=16 chars, ratio=62.50%
m=11 bytes, n=18 chars, ratio=61.11%
m=12 bytes, n=19 chars, ratio=63.16%
m=13 bytes, n=21 chars, ratio=61.90%
m=14 bytes, n=22 chars, ratio=63.64%
m=15 bytes, n=24 chars, ratio=62.50%

Projection

Sep 8, 2024

原来是分组的 Base36 ，我还以为是常规的 Base36 ，然后优化内存使用

tty0

Sep 8, 2024 via iPhone

总结来说，Base62 提供了更大的字符集，相比 Base36 能够表示更多的数据量或更大的数值，因此在需要更紧凑的编码或处理更大范围的数据时更为适用。

leonshaw

Sep 8, 2024 via Android

分组就不是 base36 了

hxysnail

Sep 9, 2024

有个疑问，为什么一定要 36 ？ 2 的幂，比如 16 32 64 ，可以直接用位运算实现不香么？