Python 中的科学计算模块 SciPy 里的积分函数效率如何？用这个是否比自己用 Fortran 或 C 写积分程序然后用 python 调用要快？

binux

2016-05-10 17:47:41 +08:00

不会

eote

2016-05-10 19:26:26 +08:00

po 要相信自己 [拍肩]

lunaticus7

2016-05-10 19:45:55 +08:00

连接 mkl 试试
自己重写基础数学函数， 99%可能性能会下降

necomancer

2016-05-10 23:21:49 +08:00

不会更快。建议你还是用 mkl 吧。
Anaconda 现在有 mkl 优化。
当然，你如果那么追求速度，试试 numbapro 里的 cuda 优化， Anaconda 也有。显卡加速那效率果断飙升。

yanyuechuixue

2016-05-11 08:42:49 +08:00

@necomancer 好的，谢谢，我的确很追求速度.

justou

2016-05-11 11:16:18 +08:00

效率已经很高了, 但无法满足较为苛刻的计算问题, 毕竟是 python 扩展(套上了 gil 的脚镣)
scipy 的 quad 函数底层是 f77 写的, 它会根据给定的参数决定具体调用哪个 fortran 的 subroutine 积分,
def quad(func, a, b, args=(), full_output=0, epsabs=1.49e-8, epsrel=1.49e-8,
limit=50, points=None, weight=None, wvar=None, wopts=None, maxp1=50,
limlst=50):
if not isinstance(args, tuple):
args = (args,)
if (weight is None):
retval = _quad(func,a,b,args,full_output,epsabs,epsrel,limit,points)
else:
retval = _quad_weight(func,a,b,args,full_output,epsabs,epsrel,limlst,limit,maxp1,weight,wvar,wopts)
首先分有权重跟无权重积分的情形, 这里只看看无权重的情形.无权重时调用_quad 函数,
def _quad(func,a,b,args,full_output,epsabs,epsrel,limit,points):
......
省略多行处理 Inf 的代码
......
if points is None:
if infbounds == 0:
return _quadpack._qagse(func,a,b,args,full_output,epsabs,epsrel,limit)
else:
return _quadpack._qagie(func,bound,infbounds,args,full_output,epsabs,epsrel,limit)
else:
if infbounds != 0:
raise ValueError("Infinity inputs cannot be used with break points.")
else:
nl = len(points)
the_points = numpy.zeros((nl+2,), float)
the_points[:nl] = points
return _quadpack._qagpe(func,a,b,the_points,args,full_output,epsabs,epsrel,limit)
现在可以看到会调用 3 个不同的 fortran 函数 _qagse, _qagie, _qagpe 之一
找到 scipy 源码中的 f77 文件, 定位到 dqagse.f, dqagie.f, dqagpe.f 这几个文件,
发现:
1. dqagse.f 中调用的 subroutine dqk21; dqk21.f 中显示使用的积分方法是 gauss-kronrod rules( https://en.wikipedia.org/wiki/Gauss%E2%80%93Kronrod_quadrature_formula)

2. dqagie.f 中调用的 subroutine dqk15i; dqk15i.f 中显示使用的积分方法也是 gauss-kronrod rules

3. dqagpe.f 中也是调用的 subroutine dqk21

由此可知, 这几个都是用的高斯积分方法, 从数值分析可以知道, 高斯方法的确比梯形法, 辛普森, 龙格库塔法有优越性: 对被积函数的拟合精度高, 收敛快. 从 fortran 代码中可以看出, 里面做了大量的错误检查跟精度的处理, 因为要适用于各种不同的平台, 以及各种不同的奇葩函数.
如果具体问题具体分析, 抛开各种检查的步骤以及平台相关的处理, 只是实现一个积分算法比这种现成的库函数更快还是极有可能的, 我用 c 简单写了一个自适应幸普森积分方法:
double
simp(integrand func, /*in*/
double left_end, /*in*/
double right_end, /*in*/
int n /*in*/
)
{
double result = 0.0;
double result0 = func(left_end) + func(right_end);
double h = (right_end - left_end) / n;
double even_sum = 0.0;
double odd_sum = 0.0;
double x;

for (int i = 0; i < n; ++i){
x = left_end + i*h;
if (i%2){
odd_sum += func(x);
}
else{
even_sum += func(x);
}
}
result = h * (result0 + 2.0 * even_sum + 4.0 * odd_sum) / 3.0;
return result;

}

double
adaptive_simp(integrand func, /*in*/
double left_end, /*in*/
double right_end, /*in*/
double tol /*in*/
)
{
int n = 20;
int max_splits = 20;
double result0, result;

result0 = simp(func, left_end, right_end, n);

for (int i = 0; i < max_splits; ++i){
n = n * 2;
result = simp(func, left_end, right_end, n);
if (fabs(result - result0) < tol)
break;
result0 = result;
}

return result;

并不困难, 很小的两个函数.

scipy 用 quadpack.py 对_quadpack.pyd 做了一层包装, 如果从 scipy import 的 quad 的话, 会比直接调用_quadpack.pyd 中相应的函数多一点点 python 的开销, 在 cython 中使用 quad, _quadpack.pyd 中的_qagse, 以及上面 c 实现的 simp(gcc 添加-O3 --fast-math 选项编译)在给定相同的误差容限下对 sinx(使用 c 标准库 sin)在[0.0, pi]区间积分 100W 次
-------------------------------------------
run quad 1000000 times, use 5.559261s
run _qagse 1000000 times, use 4.717242s
run simp 1000000 times, use 3.680429s
result from quad: 2.0
result from _qagse: 2.0
result from simp: 2.0000000001
-------------------------------------------
可以看到, 结果还是可以的.(没用 fortran 试了, 太久没写生疏了, 而且放到 cython 中还是要转一次 c, 比较麻烦)

另外在上面的 simp c 函数中累加部分可以很容易的用 openmp 并行化,曾经我对一个物体分割后积分,最后所有积分累加,这儿的累加也可以在 cython 中释放掉 gil 利用 openmp 并行化, 充分利用多核. 我那个计算问题一个结果大概要 100W 次积分, 还要放到上万次的迭代中(个人 PC 上算). 当现有的库函数无法满足效率要求时, 手写一个是明智的. 曾经还想过利用 opencl 把这个问题放到显卡上去烧, 但发现难以实现, 因为在迭代的时候总是有太多的逻辑判断跟信息交换, 只把可以纯粹的并行的部分放上去呢, 又会大大增加 CPU 跟 GPU 的信息交换, 效率又会极大的下降, 所以还是压榨的 CPU.

necomancer

2016-05-11 13:08:46 +08:00

@yanyuechuixue
这有个例子， https://docs.continuum.io/numbapro/CUDAJit
有 Nvidia 显卡还是值得试一下的。

Neveroldmilk

2016-05-11 15:28:41 +08:00

扯 CUDA 就扯远了吧。

yanyuechuixue

2016-05-12 16:58:44 +08:00

@justou 非常感谢！也就是 Scipy 里给的函数虽然也是用 Fortran 写的，而且写的人技术也比较好，但因为要适应太多东西，所以加了很多我们用不到的代码，所以自己写的话有可能会快一点。

yanyuechuixue

2016-05-12 16:59:40 +08:00

@necomancer 谢谢，恰好是 NVIDIA 显卡。

justou

2016-05-12 19:03:21 +08:00

@yanyuechuixue 也不叫用不到, 里面每行检测的代码可能都影响着最后结果的正确性跟精确性, 如果真有效率问题就不要把计算密集部分交给 py 解释器, 直接由 c/c++/fortran 这些静态编译语言完成计算, 而且对于积分是最容易并行化处理的问题了, 大多数情况单个积分求和部分没有并行化的必要, 如果是很多积分的结果再累加, 就可以考虑并行了. 如果担心自己写的不够好, 也有很多现成的底层数值计算库可以直接使用, 比如 GNU Scientific Library, C 跟 py 几乎可以完美结合(本来 py 就是一个 C 库), 我喜欢在 cython 中将两者混合使用

srlp

2016-05-13 07:39:04 +08:00

@justou 干货，赞一个

ruoyu0088

2016-05-13 11:36:46 +08:00

帮助文档中有这么一条，因此你可以传递一个编译好的函数过去提高速度。

If the user desires improved integration performance, then f may instead be a ctypes function of the form:

f(int n, double args[n]),

where args is an array of function arguments and n is the length of args. f.argtypes should be set to (c_int, c_double), and f.restype should be (c_double,).

justou

2016-05-14 00:48:34 +08:00

@yanyuechuixue
@ruoyu0088
根据楼上的提示验证了一下(仅一维的情况):
-----------cintegrand.c--------------------------------------------------
#include <math.h>
double cintegrand_simp(double x)
{
// A simple c function
return sin(x);
}

double cintegrand_comp(double x)
{
// A seemed complicated c function
double numerator = x * exp(x) + sin(exp(x)) + pow(x, 3) + pow(x, 5);
double denominator = sin(x) + 1.0 + exp(x);
return sin(numerator / denominator);
}

-------------compile commands--------------------------------------------
gcc -c -O3 --fast-math cintegrand.c -o cintegrand.o
gcc -shared -fPIC -o cintegrand.dll cintegrand.o

--------------py script--------------------------------------------------
import ctypes
from math import exp, sin, pi
from scipy.integrate import quad

def pyfunc_simp(x):
"""A simple python integrand function."""
return sin(x)

def pyfunc_comp(x):
"""A seemed complicated python integrand function."""
numerator = x * exp(x) + sin(exp(x)) + x**3 + x**5
denominator = sin(x) + 1.0 + exp(x)
return sin(numerator / denominator)

def load_1d_cfunc(dll_name, func_name):
"""Load cytpes function."""
dll = ctypes.CDLL(dll_name)
cfunc = dll.__getattr__(func_name)
cfunc.restype = ctypes.c_double
cfunc.argtypes = (ctypes.c_double,)
return cfunc

cfunc_simp = load_1d_cfunc("cintegrand.dll", "cintegrand_simp")
cfunc_comp = load_1d_cfunc("cintegrand.dll", "cintegrand_comp")
width = 50
print "quadrature of a simple python function.".ljust(width), quad(pyfunc_simp, 0.0, pi)
print "quadrature of a simple c function.".ljust(width), quad(cfunc_simp, 0.0, pi)
print "quadrature of a complicated python function.".ljust(width), quad(pyfunc_comp, 0.0, pi)
print "quadrature of a complicated c function.".ljust(width), quad(cfunc_comp, 0.0, pi)

---------------timeit in ipython------------------------------------------
%paste 到 ipython 中得到结果:
quadrature of a simple python function. (2.0, 2.220446049250313e-14)
quadrature of a simple c function. (2.0, 2.220446049250313e-14)
quadrature of a complicated python function. (1.0010180983222745, 8.364968411064218e-09)
quadrature of a complicated c function. (1.0010180983222743, 8.364968390887983e-09)

In [3]: %timeit quad(pyfunc_simp, 0.0, pi)
100000 loops, best of 3: 7 us per loop

In [4]: %timeit quad(cfunc_simp, 0.0, pi)
100000 loops, best of 3: 3.84 us per loop

In [5]: %timeit quad(pyfunc_comp, 0.0, pi)
10000 loops, best of 3: 176 us per loop

In [6]: %timeit quad(cfunc_comp, 0.0, pi)
10000 loops, best of 3: 24.2 us per loop
-----------------------------------------------------------------------------
简单函数有~2x 的速度提升, 这儿这个复杂函数只有~7x 的速度提升.性能区别主要在函数值计算上面
http://docs.scipy.org/doc/scipy-0.15.0/reference/tutorial/integrate.html#faster-integration-using-ctypes

测试环境:gcc (tdm-1) 5.1.0, Python 2.7.9 32bit, scipy 0.17.1

另外 scipy 是 0.15.0 开始才有支持向 scipy.integrate 传入 ctypes function 的:
http://scipy.github.io/devdocs/release.0.15.0.html#scipy-integrate-improvements
开始用 0.14.0 版本的怎么也不对, wrong ctypes function signature...o(╯□╰)o

ruoyu0088

2016-05-14 05:10:47 +08:00

@justou 这么能干啊。那你再试试用 Cython 编译函数。在 Cython 中用 cdef 定义被积分函数，然后定义一个全局变量保存其地址。然后用 ctypes.cast 将地址 cast 成 ctypes 的函数，然后传递给 quad 。这样就可以使用%%cython 在 jupyter notebook 中直接编译被积分函数了。

justou

2016-05-14 10:24:45 +08:00

@ruoyu0088
对 ctypes 不是特别熟悉, ipython notebook 也很少用
翻了下文档, 如果没理解错的话:
--------------------------------------------------------------
In [1]:
%load_ext cythonmagic

In [2]:
%%cython -a
from ctypes import CFUNCTYPE, c_double, cast
from libc.math cimport sin, exp

cdef double cintegrand_simp(double x):
return sin(x)

cdef double cintegrand_comp(double x):
cdef double numerator = x * exp(x) + sin(exp(x)) + x**3 + x**5
cdef double denominator = sin(x) + 1.0 + exp(x)
return sin(numerator / denominator)

func_type = CFUNCTYPE(c_double, c_double)

# get function addresses
cdef int simp_func_addr = <int><void*>cintegrand_simp
cdef int comp_func_addr = <int><void*>cintegrand_comp

# cast the addresses to ctypes functions
simp_integrand = cast(simp_func_addr, func_type)
comp_integrand = cast(comp_func_addr, func_type)

In [3]:
from math import exp, sin, pi
from scipy.integrate import quad

def pyfunc_simp(x):
"""A simple python integrand function."""
return sin(x)

def pyfunc_comp(x):
"""A seemed complicated python integrand function."""
numerator = x * exp(x) + sin(exp(x)) + x**3 + x**5
denominator = sin(x) + 1.0 + exp(x)
return sin(numerator / denominator)

In [5]:
print quad(simp_integrand, 0.0, pi)
print quad(pyfunc_simp, 0.0, pi)
print quad(comp_integrand, 0.0, pi)
print quad(pyfunc_comp, 0.0, pi)
(2.0, 2.220446049250313e-14)
(2.0, 2.220446049250313e-14)
(1.0010180983222745, 8.364968411064218e-09)
(1.0010180983222745, 8.364968411064218e-09)
In [6]:
%timeit quad(simp_integrand, 0.0, pi)
100000 loops, best of 3: 3.55 µs per loop
In [7]:
%timeit quad(pyfunc_simp, 0.0, pi)
100000 loops, best of 3: 7.12 µs per loop
In [8]:
%timeit quad(comp_integrand, 0.0, pi)
10000 loops, best of 3: 32.6 µs per loop
In [9]:
%timeit quad(pyfunc_comp, 0.0, pi)
10000 loops, best of 3: 168 µs per loop
--------------------------------------------------------------
这儿编译好的比之前的稍显慢一点呢? 加上编译器优化选项后还是一样的结果.
翻 ctypes 文档时顺便解决了之前的一个问题:
我用 C 实现了一个算法, 算法需要接收一个可调用的函数对象,之前一直都是在 cython 中完成的,
但是有些时候又想能直接传一个 py 的函数对象过去,之前想尝试的办法是利用 Python.h 里面的
函数得到 py 函数对象的函数指针(scipy 似乎是用的这种方式),
现在发现 ctypes 可以很简单的完成这个任务!ヽ(・∀・)ノ

yanyuechuixue

2016-05-16 08:56:31 +08:00

@ruoyu0088
@justou
厉害厉害！

justou

2016-05-18 11:49:09 +08:00

@ruoyu0088
原来您说的这个技巧写在第二版书里面了啊
拿到书刚看完了 cython 那一章, 收获不小 :-D

sjlinger

2019-03-19 19:03:26 +08:00

@justou 您好，最近在用 Python 调用 Fortran 程序时发现一个很费解的问题。问题是这样的：我用 Fortran 程序写了一个三重循环大概每个循环在 1000 次左右，用 ifort 编译后运行用时 15s 左右，同一个三重循环，稍微改变下适合 f2py 编译及 Python 调用，循环则一点没变，用 f2py 编译完用 Python 调用，这三重循环用时 50s 左右，请问这是为什么？谢谢

justou

2019-03-19 19:57:50 +08:00

@sjlinger 三年前的帖子都被你挖起来了, 重新开个贴, 把问题仔细描述下吧, 附带你的代码, 怎么编译的, 怎么调用的, 怎么测试的都写清楚才知道问题在哪儿, 你这样说是猜不出来的:)