llama-bench -m <模型路径>
test\backend | ipex-llm | sycl | vulkan |
---|---|---|---|
pp512 | 458.82 | 192.35 | 64.35 |
tg128 | 7.09 | 6.55 | 11.60 |
4090 (572.60) + X670E, llama-b4820-bin-win-cuda-cu12.4-x64
pp512: 2291.15, tg128: 40.55
https://github.com/ProjectPhysX/OpenCL-Benchmark/releases/download/v1.8/OpenCL-Benchmark-Windows.exe
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID 0 | Intel(R) Arc(TM) A770 Graphics |
| Device ID 1 | Intel(R) Arc(TM) A750 Graphics |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | Intel(R) Arc(TM) A770 Graphics |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 32.0.101.6559 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 512 at 2400 MHz (4096 cores, 19.661 TFLOPs/s) |
| Memory, Cache | 16255 MB VRAM, 16384 KB global / 64 KB local |
| Buffer Limits | 4095 MB global, 4194296 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute not supported |
| FP32 compute 12.196 TFLOPs/s (2/3 ) |
| FP16 compute 18.425 TFLOPs/s ( 1x ) |
| INT64 compute 1.191 TIOPs/s (1/16) |
| INT32 compute 5.687 TIOPs/s (1/4 ) |
| INT16 compute 30.045 TIOPs/s ( 2x ) |
| INT8 compute 29.282 TIOPs/s ( 1x ) |
| Memory Bandwidth ( coalesced read ) 223.97 GB/s |
| Memory Bandwidth ( coalesced write) 432.86 GB/s |
| Memory Bandwidth (misaligned read ) 400.16 GB/s |
| Memory Bandwidth (misaligned write) 438.62 GB/s |
| PCIe Bandwidth (send ) 9.30 GB/s |
| PCIe Bandwidth ( receive ) 9.00 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 9.90 GB/s |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID | 1 |
| Device Name | Intel(R) Arc(TM) A750 Graphics |
| Device Vendor | Intel(R) Corporation |
| Device Driver | 32.0.101.6559 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 448 at 2400 MHz (3584 cores, 17.203 TFLOPs/s) |
| Memory, Cache | 8095 MB VRAM, 16384 KB global / 64 KB local |
| Buffer Limits | 3967 MB global, 4062248 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute not supported |
| FP32 compute 10.693 TFLOPs/s (2/3 ) |
| FP16 compute 16.177 TFLOPs/s ( 1x ) |
| INT64 compute 1.090 TIOPs/s (1/16) |
| INT32 compute 5.043 TIOPs/s (1/3 ) |
| INT16 compute 26.553 TIOPs/s ( 2x ) |
| INT8 compute 26.611 TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read ) 210.06 GB/s |
| Memory Bandwidth ( coalesced write) 434.85 GB/s |
| Memory Bandwidth (misaligned read ) 399.86 GB/s |
| Memory Bandwidth (misaligned write) 441.22 GB/s |
| PCIe Bandwidth (send ) 9.35 GB/s |
| PCIe Bandwidth ( receive ) 9.04 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 9.94 GB/s |
|-----------------------------------------------------------------------------|
|-----------------------------------------------------------------------------|
| Done. Press Enter to exit. |
'-----------------------------------------------------------------------------'
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA GeForce RTX 4090 |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 572.60 (Windows) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 128 at 2535 MHz (16384 cores, 83.067 TFLOPs/s) |
| Memory, Cache | 24563 MB VRAM, 3584 KB global / 48 KB local |
| Buffer Limits | 6140 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 compute 1.401 TFLOPs/s (1/64) |
| FP32 compute 85.239 TFLOPs/s ( 1x ) |
| FP16 compute 88.567 TFLOPs/s ( 1x ) |
| INT64 compute 4.204 TIOPs/s (1/24) |
| INT32 compute 44.164 TIOPs/s (1/2 ) |
| INT16 compute 38.203 TIOPs/s (1/2 ) |
| INT8 compute 133.384 TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read ) 925.72 GB/s |
| Memory Bandwidth ( coalesced write) 898.38 GB/s |
| Memory Bandwidth (misaligned read ) 923.73 GB/s |
| Memory Bandwidth (misaligned write) 212.93 GB/s |
| PCIe Bandwidth (send ) 15.66 GB/s |
| PCIe Bandwidth ( receive ) 14.80 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen4 x16) 15.24 GB/s |
|-----------------------------------------------------------------------------|
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.