有没有人注意观察过， Python 多进程执行同一程序速度比单进程执行慢很多，原因是什么？

如题，我在测试 ctypes 释放 GIL 的过程中发现这个问题，即使使用 c 代码将 GIL 释放，多线程并行的效率并不是比如我有 N 个线程那么程序的运算能力就变成 N 倍。即使线程之间完全没有资源竞争问题，这个是令我很意外的一个点。

我觉得可能的原因是线程之间始终要进行一些状态同步，那 OK 我使用多进程总归是完全隔离了吧，结果测试结果没有太大变化，令人大跌眼镜。

我理解上，进程互相之间完全独立，如果你的物理计算资源足够（比如我使用的 CPU 是 8 核心 16 线程的），那么你运行 8 个独立的进程，他们应该是互相完全独立，速度互不干扰的，但实验结果并非如此，请问一下 v 友们之中有没有大佬能解释一下原因，谢谢。

=====

测试代码如下，因为我无法上传 DLL，使用递归菲波那切数列模拟 CPU 密集型任务。这会使多线程执行时间线性增长，但理论不应影响到多进程。另外以下实验代码中使用子进程的方式，我担心可能是子进程状态同步导致的效率损失，但实际手动在 shell 中启动多个不同进程，实验结果没有区别。

以下使用的进程池 /线程池都经过了预激。

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
import time

def pre_activate(times):
    time.sleep(times)

def execution():
	
    def fib(n):
        if n<=1:
            return 1
        return fib(n-1) + fib(n-2)

    for i in range(20):
        fib(30)

if __name__ == "__main__":

    core_num = 8
    st_time = time.time()
    execution()
    single_execute_time = time.time() - st_time
    print(f"Single thread execute time: {round(single_execute_time,4)} s")

    with ThreadPoolExecutor(max_workers=core_num) as executor:
        # pre-activate {core_num} threads in threadpoolexecutor
        pre_task = [executor.submit(pre_activate, times) \
            for times in [0.5 for _ in range(core_num)]]
        for future in as_completed(pre_task):future.result()

        st_time = time.time()
        tasks = [executor.submit(execution) for _ in range(core_num)]
        for future in as_completed(tasks):future.result()
        print(f"Multi thread execute time: {round(time.time() - st_time,4)} s",
              f", speedup: {round(core_num * single_execute_time / (time.time() - st_time),2)} x")

    with ProcessPoolExecutor(max_workers=core_num) as executor:
        #
        pre_task = [executor.submit(pre_activate, times) 
            for times in [0.5 for _ in range(core_num)]]
        for future in as_completed(pre_task):future.result()

        st_time = time.time()
        tasks = [executor.submit(execution) for _ in range(core_num)]
        for future in as_completed(tasks):future.result()
        print(f"Multi Process execute time: {round(time.time() - st_time,4)} s",
              f", speedup: {round(core_num * single_execute_time / (time.time() - st_time),2)} x")

我的本地执行结果是：

Single thread execute time: 4.117 s
Multi thread execute time: 32.888 s , speedup: 1.0 x
Multi Process execute time: 12.1088 s , speedup: 2.72 x

无论更换哪些 CPU 密集型任务，speedup 几乎很难提升到 3 倍以上，即使使用 8 核心并行计算，为什么？

这个结果同时让我想起一些以前的跑分经验，比如进入异步时代以后使用 gunicorn 单线程部署一个 web 服务通常 echo 可以做到每秒钟两万次以上，但使用 prefork 的多进程，也不过将这个数值提升 2-2.5 倍，并不能提升很多，以前没有细究，现在觉得不太对