Intel GPU 的 llama-bench 测试结果

12 天前
 HojiOShi

硬件

软件

结果

llama-bench -m <模型路径>

test\backend ipex-llm sycl vulkan
pp512 458.82 192.35 64.35
tg128 7.09 6.55 11.60

对比用结果

4090 (572.60) + X670E, llama-b4820-bin-win-cuda-cu12.4-x64

pp512: 2291.15, tg128: 40.55

https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/download/v1.8/OpenCL-Benchmark-Windows.exe

.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
| Device ID    1 | Intel(R) Arc(TM) A750 Graphics                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Intel(R) Arc(TM) A770 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 32.0.101.6559 (Windows)                                    |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 512 at 2400 MHz (4096 cores, 19.661 TFLOPs/s)              |
| Memory, Cache  | 16255 MB VRAM, 16384 KB global / 64 KB local               |
| Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                          not supported        |
| FP32  compute                                        12.196 TFLOPs/s (2/3 ) |
| FP16  compute                                        18.425 TFLOPs/s ( 1x ) |
| INT64 compute                                         1.191  TIOPs/s (1/16) |
| INT32 compute                                         5.687  TIOPs/s (1/4 ) |
| INT16 compute                                        30.045  TIOPs/s ( 2x ) |
| INT8  compute                                        29.282  TIOPs/s ( 1x ) |
| Memory Bandwidth ( coalesced read      )                        223.97 GB/s |
| Memory Bandwidth ( coalesced      write)                        432.86 GB/s |
| Memory Bandwidth (misaligned read      )                        400.16 GB/s |
| Memory Bandwidth (misaligned      write)                        438.62 GB/s |
| PCIe   Bandwidth (send                 )                          9.30 GB/s |
| PCIe   Bandwidth (   receive           )                          9.00 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.90 GB/s |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 1                                                          |
| Device Name    | Intel(R) Arc(TM) A750 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 32.0.101.6559 (Windows)                                    |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 448 at 2400 MHz (3584 cores, 17.203 TFLOPs/s)              |
| Memory, Cache  | 8095 MB VRAM, 16384 KB global / 64 KB local                |
| Buffer Limits  | 3967 MB global, 4062248 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                          not supported        |
| FP32  compute                                        10.693 TFLOPs/s (2/3 ) |
| FP16  compute                                        16.177 TFLOPs/s ( 1x ) |
| INT64 compute                                         1.090  TIOPs/s (1/16) |
| INT32 compute                                         5.043  TIOPs/s (1/3 ) |
| INT16 compute                                        26.553  TIOPs/s ( 2x ) |
| INT8  compute                                        26.611  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                        210.06 GB/s |
| Memory Bandwidth ( coalesced      write)                        434.85 GB/s |
| Memory Bandwidth (misaligned read      )                        399.86 GB/s |
| Memory Bandwidth (misaligned      write)                        441.22 GB/s |
| PCIe   Bandwidth (send                 )                          9.35 GB/s |
| PCIe   Bandwidth (   receive           )                          9.04 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    9.94 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit.                                                  |
'-----------------------------------------------------------------------------'

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 4090                                    |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 572.60 (Windows)                                           |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 128 at 2535 MHz (16384 cores, 83.067 TFLOPs/s)             |
| Memory, Cache  | 24563 MB VRAM, 3584 KB global / 48 KB local                |
| Buffer Limits  | 6140 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         1.401 TFLOPs/s (1/64) |
| FP32  compute                                        85.239 TFLOPs/s ( 1x ) |
| FP16  compute                                        88.567 TFLOPs/s ( 1x ) |
| INT64 compute                                         4.204  TIOPs/s (1/24) |
| INT32 compute                                        44.164  TIOPs/s (1/2 ) |
| INT16 compute                                        38.203  TIOPs/s (1/2 ) |
| INT8  compute                                       133.384  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                        925.72 GB/s |
| Memory Bandwidth ( coalesced      write)                        898.38 GB/s |
| Memory Bandwidth (misaligned read      )                        923.73 GB/s |
| Memory Bandwidth (misaligned      write)                        212.93 GB/s |
| PCIe   Bandwidth (send                 )                         15.66 GB/s |
| PCIe   Bandwidth (   receive           )                         14.80 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   15.24 GB/s |
|-----------------------------------------------------------------------------|

483 次点击
所在节点    Local LLM
0 条回复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/1116183

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX