YUX
V2EX  ›  macOS

m1 有原生 numpy scipy 了

  •  3
     
  •   YUX ·
    PRO
    · Dec 9, 2020 · 8873 views
    This topic created in 1983 days ago, the information mentioned may be changed or developed.

    https://github.com/conda-forge/miniforge

    先下载对应版本的 Miniforge3, ====> OS X arm64 (Apple Silicon)

    装上之后就有 conda 了,conda 里面装 numpy,scipy 什么的都是原生的

    性能提升很大 无论对比 Rosetta 2 还是 intel i9

    Supplement 1  ·  Dec 9, 2020
    大家来分享一下各自的 benchmark 吧😂
    42 replies    2021-04-23 04:02:49 +08:00
    pb941129
        1
    pb941129  
       Dec 9, 2020 via iPhone
    想知道对比 Intel i9 mkl 版 numpy 提升多少……
    NoobX
        2
    NoobX  
       Dec 9, 2020 via iPhone
    然而 16g 封顶...
    Goldilocks
        3
    Goldilocks  
       Dec 9, 2020 via Android
    期待 benchmark,估计被 avx512 吊打
    felixcode
        4
    felixcode  
    PRO
       Dec 9, 2020 via Android
    显存比你内存大
    YUX
        5
    YUX  
    OP
    PRO
       Dec 9, 2020
    @pb941129
    @NoobX
    @Goldilocks
    @felixcode



    找到了个 numpy 性能脚本 跑了一下 https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

    ```
    Dotted two 4096x4096 matrices in 0.53 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.59 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 4.74 s.

    This was obtained using the following Numpy configuration:
    blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
    language = c
    lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    language = f77
    lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/Users/yux/miniforge3/envs/maths/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/yux/miniforge3/envs/maths/include']
    `
    ```




    p.s. python 版本 3.9.1 -arm64 跑的时候关掉了所有后台
    pb941129
        6
    pb941129  
       Dec 9, 2020   ❤️ 1
    @YUX Thx 这是我 16 寸 MBP i9 款跑出来的结果。没有关后台。环境 anaconda 3.8 。看上去比 M1 还是快一点的。(不然 Intel 真的要哭)

    ```
    Dotted two 4096x4096 matrices in 0.45 s.
    Dotted two vectors of length 524288 in 0.05 ms.
    SVD of a 2048x1024 matrix in 0.32 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 3.53 s.

    This was obtained using the following Numpy configuration:
    blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']
    blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']
    lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']
    lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/xxx/anaconda/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/xxx/anaconda/include']

    ```
    changepc90
        7
    changepc90  
       Dec 9, 2020
    M1:Dotted two vectors of length 524288 in 0.25 ms
    MBP16:Dotted two vectors of length 524288 in 0.05 ms.
    这一项差的好多啊。
    YUX
        8
    YUX  
    OP
    PRO
       Dec 9, 2020
    @pb941129 不错还是 i9 强😂 是不是跑的时候 8 核 16 线程都占满了
    YUX
        9
    YUX  
    OP
    PRO
       Dec 9, 2020
    @changepc90 这应该就是指令集差异造成的叭
    Aspector
        10
    Aspector  
       Dec 9, 2020   ❤️ 1
    T480s 上的 i7 8550u,库是 mkl_rt

    Dotted two 4096x4096 matrices in 1.07 s.
    Dotted two vectors of length 524288 in 0.13 ms.
    SVD of a 2048x1024 matrix in 0.53 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
    Eigendecomposition of a 2048x2048 matrix in 5.07 s.

    用 HWMonitor 读出来 8550u 的实时功耗大概在 40-45W,M1 应该才 20W 吧(悲
    YUX
        11
    YUX  
    OP
    PRO
       Dec 9, 2020
    分享一下朋友的 16inch 2.6 GHz 6-Core Intel Core i7

    Dotted two 4096x4096 matrices in 0.49 s.
    Dotted two vectors of length 524288 in 0.05 ms.
    SVD of a 2048x1024 matrix in 0.32 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.07 s.
    Eigendecomposition of a 2048x2048 matrix in 3.16 s.
    YUX
        12
    YUX  
    OP
    PRO
       Dec 9, 2020
    @Aspector air 的 m1 限制在 10 瓦😂
    pb941129
        13
    pb941129  
       Dec 9, 2020 via iPhone
    @YUX 没看任务,不过以我对 numpy 尿性的理解,不至于不至于。可以等 lightgbm 适配了然后一起跑跑 CPU 版本(当时跑一个小项目找最优参数跑满整个 8700k 三小时
    rock_cloud
        14
    rock_cloud  
       Dec 9, 2020   ❤️ 1
    2017 iMac 3.4Ghz Intel i5
    Dotted two 4096x4096 matrices in 1.04 s.
    Dotted two vectors of length 524288 in 0.17 ms.
    SVD of a 2048x1024 matrix in 0.58 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.12 s.
    Eigendecomposition of a 2048x2048 matrix in 5.37 s.
    没关任何后台
    YUX
        15
    YUX  
    OP
    PRO
       Dec 9, 2020
    @pb941129 烤鸡仨小时啊 我能在冰箱里测么😂 没风扇怕烤糊了
    sxd96
        16
    sxd96  
       Dec 9, 2020   ❤️ 1
    18 年 13 寸 MBP i5-8259U

    Dotted two 4096x4096 matrices in 0.80 s.
    Dotted two vectors of length 524288 in 0.11 ms.
    SVD of a 2048x1024 matrix in 0.35 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
    Eigendecomposition of a 2048x2048 matrix in 3.39 s.
    sxd96
        17
    sxd96  
       Dec 9, 2020
    @sxd96 感觉心里平衡了一点点,也是没关后台,mkl 库。但是我发现在核心满负载的情况下,MBP 会有一点点电啸声。虽然现在 ARM 在这上面可能差了一点点,但是如果算能效比,可能并不差。我觉得移动设备重要的还是能效比。
    Gandum
        18
    Gandum  
       Dec 9, 2020 via iPhone
    还是初步版本。不过现在是冬天还不用急,风扇不太吵。明年夏天再买。
    FurN1
        19
    FurN1  
       Dec 9, 2020 via iPhone   ❤️ 1
    哈哈我五个月前发帖讲过啦 /t/688402
    rock_cloud
        20
    rock_cloud  
       Dec 9, 2020   ❤️ 1
    Intel Xeon Silver 4114 2.2Ghz
    Dotted two 4096x4096 matrices in 0.60 s.
    Dotted two vectors of length 524288 in 0.04 ms.
    SVD of a 2048x1024 matrix in 0.66 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.26 s.
    Eigendecomposition of a 2048x2048 matrix in 6.67 s.
    YUX
        21
    YUX  
    OP
    PRO
       Dec 9, 2020   ❤️ 1
    @IgniteWhite 太超前啦😂确实是个好东西
    Tilie
        22
    Tilie  
       Dec 9, 2020   ❤️ 1
    8 代 i7 mac mini
    Dotted two 4096x4096 matrices in 0.76 s.
    Dotted two vectors of length 524288 in 0.09 ms.
    SVD of a 2048x1024 matrix in 0.56 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
    Eigendecomposition of a 2048x2048 matrix in 5.20 s.
    YUX
        23
    YUX  
    OP
    PRO
       Dec 9, 2020
    Google Colab - 2 Intel(R) Xeon(R) CPU @ 2.20GHz

    Dotted two 4096x4096 matrices in 4.16 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 1.49 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.23 s.
    Eigendecomposition of a 2048x2048 matrix in 13.11 s.
    zr86
        24
    zr86  
       Dec 9, 2020
    M1 Mac mini

    Dotted two 4096x4096 matrices in 0.69 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.68 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 4.82 s.
    kalimpong
        25
    kalimpong  
       Dec 9, 2020
    M1 MacBook Pro

    Dotted two 4096x4096 matrices in 0.68 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.71 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 5.03 s.

    同时用 powermetrics 测量功耗,前两项约 26W,后三项约 16W
    lovestudykid
        26
    lovestudykid  
       Dec 10, 2020
    这个测试拉不开差距
    MF839,只是比楼主的 M1 慢了一倍
    Dotted two 4096x4096 matrices in 2.33 s.
    Dotted two vectors of length 524288 in 0.54 ms.
    SVD of a 2048x1024 matrix in 1.05 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.20 s.
    Eigendecomposition of a 2048x2048 matrix in 8.38 s.


    Intel(R) Xeon(R) Gold 6134
    Dotted two 4096x4096 matrices in 0.32 s.
    Dotted two vectors of length 524288 in 0.05 ms.
    SVD of a 2048x1024 matrix in 0.89 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.15 s.
    Eigendecomposition of a 2048x2048 matrix in 8.19 s.
    Anaconda 默认安装的 numpy 版本没有用 mkl,也没有开启 avx512,这个 cpu 是浪费了
    pubby
        27
    pubby  
       Dec 10, 2020
    3700X 黑苹果

    Dotted two 4096x4096 matrices in 0.46 s.
    Dotted two vectors of length 524288 in 0.08 ms.
    SVD of a 2048x1024 matrix in 7.37 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.82 s.
    Eigendecomposition of a 2048x2048 matrix in 49.05 s.

    This was obtained using the following Numpy configuration:
    atlas_threads_info:
    NOT AVAILABLE
    blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3', '-I/AppleInternal/BuildRoot/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.0.Internal.sdk/System/Library/Frameworks/vecLib.framework/Headers']
    define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_blas_threads_info:
    NOT AVAILABLE
    openblas_info:
    NOT AVAILABLE
    lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    extra_compile_args = ['-msse3']
    define_macros = [('NO_ATLAS_INFO', 3)]
    atlas_info:
    NOT AVAILABLE
    lapack_mkl_info:
    NOT AVAILABLE
    blas_mkl_info:
    NOT AVAILABLE
    atlas_blas_info:
    NOT AVAILABLE
    mkl_info:
    NOT AVAILABLE


    使用姿势不太对....
    bnuliujing
        28
    bnuliujing  
       Dec 10, 2020
    i7-6950X 的成绩

    Dotted two 4096x4096 matrices in 0.35 s.
    Dotted two vectors of length 524288 in 0.03 ms.
    SVD of a 2048x1024 matrix in 0.27 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.10 s.
    Eigendecomposition of a 2048x2048 matrix in 3.39 s.
    NoobX
        29
    NoobX  
       Dec 10, 2020
    Mac Mini i5 款的成绩

    Dotted two 4096x4096 matrices in 0.58 s.
    Dotted two vectors of length 524288 in 0.08 ms.
    SVD of a 2048x1024 matrix in 0.32 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 3.30 s.

    M1 成绩印象也不太深刻。。。
    不过 16G 内存依旧是一个大问题,系统一般自己就吃掉 4G,16G 只有 12G 放 dataset,老实讲对我不太够用
    处理器慢点问题不大,swap 吃满了,那速度是真的噩梦
    MisakaTian
        30
    MisakaTian  
       Dec 10, 2020
    数据狗表示 anaconda 搞定就上
    Goldilocks
        31
    Goldilocks  
       Dec 10, 2020
    Processor Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz, 3600 Mhz, 4 Core

    Dotted two 4096x4096 matrices in 0.33s ,比 m1 快一倍。但是 m1 是 8 核哦。所以同等频率同样核数,intel 还是要比 m1 快 3-4 倍左右,这还是 3 年前的产品。
    YUX
        32
    YUX  
    OP
    PRO
       Dec 10, 2020 via iPhone
    @MisakaTian 用 mamba 啊
    Goldilocks
        33
    Goldilocks  
       Dec 10, 2020
    现在是 2020 年。Intel 如果出个 2 核 3.6G 的 cpu,你肯定看不上它的性能。你要想的是 Intel 10 核、20 核。马上 AMD 都要发布 64 核桌面 CPU 了,apple 还停留在 2 核的水准。
    meloyang05
        34
    meloyang05  
       Dec 10, 2020
    @Goldilocks

    “8 代 i7 mac mini
    Dotted two 4096x4096 matrices in 0.76 s.
    Dotted two vectors of length 524288 in 0.09 ms.
    SVD of a 2048x1024 matrix in 0.56 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.09 s.
    Eigendecomposition of a 2048x2048 matrix in 5.20 s.

    M1 Mac mini

    Dotted two 4096x4096 matrices in 0.69 s.
    Dotted two vectors of length 524288 in 0.25 ms.
    SVD of a 2048x1024 matrix in 0.68 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.08 s.
    Eigendecomposition of a 2048x2048 matrix in 4.82 s.”

    你选择性无视其他测试成绩么。。时间在 ms 级别本来误差就可能很大,也可能是 numpy for m1 现在有 bug,你单独拎 vector 的成绩出来能说明什么问题?
    Goldilocks
        35
    Goldilocks  
       Dec 10, 2020
    误差不会很大,一般都在 1%以内。因为矩阵乘法就受两个限制:

    1. CPU flops
    2. 内存带宽
    Goldilocks
        36
    Goldilocks  
       Dec 10, 2020
    像矩阵乘法这样的数值计算是很成熟的领域,大家都研究的很透了。请参见这个: https://en.wikichip.org/wiki/flops

    假设内存带宽能跟得上 cpu 的速度,要么要想跑的更快,就只有:
    1. 增加核数
    2. 增加 SIMD 的长度

    比如 skylake 可以做到 64 FLOPs/cycle,但是同时代的 AMD CPU 只有 16 FLOPs/cycle 。大家主频都差不多,这其中的 4 倍就造成了主要的差距。而且这种差距很难追赶上,可以说一辈子都没希望。
    Harry1993
        37
    Harry1993  
       Dec 10, 2020
    用 Apple 的 numpy ( https://github.com/apple/tensorflow_macos)試了一下:

    Dotted two 4096x4096 matrices in 0.84 s.
    Dotted two vectors of length 524288 in 0.11 ms.
    SVD of a 2048x1024 matrix in 0.54 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.06 s.
    Eigendecomposition of a 2048x2048 matrix in 6.29 s.
    FurN1
        38
    FurN1  
       Dec 10, 2020
    @MisakaTian miniforge 的包管理器不就是 conda 么…只是默认 channel 是 conda-forge
    lly0514
        39
    lly0514  
       Dec 11, 2020
    @Goldilocks 实际上误差非常大,我实测 MKL vs openblas 的性能差距有一倍多
    Richardyyz
        40
    Richardyyz  
       Dec 13, 2020
    @Goldilocks ZEN2 都已经 32 FLOPs/cycle 了,你这一辈子这么短吗?降频严重的 AVX512 并没有在 ZEN3 面前有多么大的优势。
    YUX
        41
    YUX  
    OP
    PRO
       Jan 24, 2021
    补充一个树莓派的😂

    Dotted two 4096x4096 matrices in 10.18 s.
    Dotted two vectors of length 524288 in 2.27 ms.
    SVD of a 2048x1024 matrix in 6.67 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.85 s.
    Eigendecomposition of a 2048x2048 matrix in 37.83 s.

    This was obtained using the following Numpy configuration:
    blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    include_dirs = ['/root/mambaforge/envs/maths/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    include_dirs = ['/root/mambaforge/envs/maths/include']
    language = c
    lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    language = f77
    lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/mambaforge/envs/maths/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/root/mambaforge/envs/maths/include']
    YRInc
        42
    YRInc  
       Apr 23, 2021
    提供一个国产的给大家参考:鲲鹏 920

    12 核 鲲鹏 920 24G 内存:
    -------------------
    Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 15:45:16)

    Dotted two 4096x4096 matrices in 1.48 s.
    Dotted two vectors of length 524288 in 0.49 ms.
    SVD of a 2048x1024 matrix in 1.10 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.14 s.
    Eigendecomposition of a 2048x2048 matrix in 8.36 s.
    -------------------


    24 核 鲲鹏 920 48G 内存:
    -------------------
    Dotted two 4096x4096 matrices in 0.76 s.
    Dotted two vectors of length 524288 in 0.48 ms.
    SVD of a 2048x1024 matrix in 0.93 s.
    Cholesky decomposition of a 2048x2048 matrix in 0.13 s.
    Eigendecomposition of a 2048x2048 matrix in 7.66 s.


    与 M1 Mac 用的同样的环境,Miniforge3,相关的加速库如下:
    blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    include_dirs = ['/root/miniforge3/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    include_dirs = ['/root/miniforge3/include']
    language = c
    lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    language = f77
    lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/root/miniforge3/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/root/miniforge3/include']
    About   ·   Help   ·   Advertise   ·   Blog   ·   API   ·   FAQ   ·   Solana   ·   3050 Online   Highest 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 80ms · UTC 06:50 · PVG 14:50 · LAX 23:50 · JFK 02:50
    ♥ Do have faith in what you're doing.