全国直销电话:4006-854-568
IT-technology
以人为本,众志成城,以“用户至上”.“服务上乘”为原则,
追求产品和服务高质量,努力实现与客户之间真诚有效的沟通,
不断地圆梦、奔跑与腾飞。
新闻动态   NEWS
【转载】主流CPU性能比较(Hygon7280、Intel、AMD、鲲鹏920、飞腾2500) -北京赛维博信科技发展有限公司
来源:本人摘自网络,如有侵权请联系删除 | 作者:毛豆 | 发布时间: 2024-06-17 | 5260 次浏览 | 分享到:

说明

正好看到关于 CPU 性能比较的文章,就转载给大家参考,原文请移步

https://plantegg.github.io/2022/01/13/%E4%B8%8D%E5%90%8CCPU%E6%80%A7%E8%83%BD%E5%A4%A7PK/

文章时间:2022-01-13,因此很多结论仅供参考

相关文章:主流CPU性能摸底(Intel/AMD/鲲鹏/海光/飞腾) - https://zhuanlan.zhihu.com/p/540655373

先放出作者的对比结论(大家需要认真阅读,结论仅代表作者个人意见)

结论


  • AMD 单核跑分数据比较好

  • MySQL 查询场景下 Intel 的性能好很多

  • xdb 比社区版性能要好

  • MySQL8.0 比 5.7 在多核锁竞争场景下性能要好

  • intel 最好,AMD 接近Intel,海光差的比较远但是又比鲲鹏好很多,飞腾最差,尤其是跨socket 简直是灾难

  • 麒麟 OS 性能也比 CentOS 略差一些

  • 从  perf 指标来看 鲲鹏 920 的 L1d 命中率高于 8163 是因为鲲鹏 L1 size 大;L2 命中率低于 8163,同样是因为鲲鹏  L2 size 小;同样 L1i 鲲鹏也大于 8163,但是实际跑起来 L1i Miss Rate 更高,这说明 ARM 对 L1d  使用效率低

整体来说AMD用领先了一代的工艺(7nm VS 14nm),在MySQL查询场景中终于可以接近Intel了,但是海光、鲲鹏、飞腾还是不给力。

前言

比较 Hygon7280、Intel、AMD、鲲鹏 920、飞腾 2500 的性能情况

CPU型号Hygon 7280AMD 7H12AMD 7T83Intel 8163鲲鹏920飞腾2500倚天710
物理核数323264244864128core
超线程2222


2222221
NUMA Node82424162
L1d32K32K32K32K64K32K64K
L2512K512K512K1024K512K2048K1024K

AMD 7T83 有 8 个 Die, 每个 Die L3 大小 32M,L2 大小 4MiB, 每个 Die上 L1I/L1D 各256 KiB,每个 Die 有 8 core,2、3 代都是带有独立 IO Die。
倚天 710 是一路服务器,单芯片 2 块对称的 Die。

参与比较的几款CPU参数

IPC的说明:

IPC: insns per cycle insn/cycles 也就是每个时钟周期能执行的指令数量,越大程序跑的越快

程序的执行时间 = 指令数/(主频*IPC) //单核下,多核的话再除以核数

Hygon 7280

Hygon 7280 就是 AMD Zen 架构,最大 IPC 能到 5。



1 架构:x86_64CPU
2 CPU 运行模式:32-bit, 64-bit
3 字节序:                         Little Endian
4 Address sizes:                   43 bits physical, 48 bits virtual
5 CPU:                             128
6 在线 CPU 列表:0-127
7 每个核的线程数:                 2
8 每个座的核数:                   32
9 座:                             2
10 NUMA 节点:8
11 厂商 ID:HygonGenuine
12 CPU 系列:24
13 型号:                           1
14 型号名称:                       Hygon C86 7280 32-core Processor
15 步进:                           1
16 CPU MHz:2194.586
17 BogoMIPS:                       3999.63
18 虚拟化:                         AMD-V
19 L1d 缓存:2 MiB
20 L1i 缓存:4 MiB
21 L2 缓存:32 MiB
22 L3 缓存:128 MiB
23 NUMA 节点0 CPU:0-7,64-71
24 NUMA 节点1 CPU:8-15,72-79
25 NUMA 节点2 CPU:16-23,80-87
26 NUMA 节点3 CPU:24-31,88-95
27 NUMA 节点4 CPU:32-39,96-103
28 NUMA 节点5 CPU:40-47,104-111
29 NUMA 节点6 CPU:48-55,112-119
30 NUMA 节点7 CPU:56-63,120-127


架构说明:

每个  CPU 有4 个 Die,每个 Die 有两个 CCX(2  core-Complexes),每个CCX最多有4core(例如7280/7285)共享一个L3 cache;每个Die有两个Memory  Channel,每个CPU带有8个Memory Channel,并且每个Memory Channel最多支持2根Memory;

海光7系列架构图:

曙光H620-G30A 机型硬件结构,CPU是hygon 7280(截图只截取了Socket0)

AMD EPYC 7T83(NC)

两路服务器,4 numa node,Z3 架构。

详细信息:

1 #lscpu
2 Architecture:          x86_64
3 CPU op-mode(s):        32-bit, 64-bit
4 Byte Order:            Little Endian
5 CPU(s):                256
6 On-line CPU(s) list:   0-255
7 Thread(s) per core:    2
8 Core(s) per socket:    64
9 Socket(s):             2
10 NUMA node(s):          4
11 Vendor ID:             AuthenticAMD
12 CPU family:            25
13 Model:                 1
14 Model name:            AMD EPYC 7T83 64-Core Processor
15 Stepping:              1
16 CPU MHz:               2154.005
17 CPU max MHz:           2550.0000
18 CPU min MHz:           1500.0000
19 BogoMIPS:              5090.93
20 Virtualization:        AMD-V
21 L1d cache:             32K
22 L1i cache:             32K
23 L2 cache:              512K
24 L3 cache:              32768K
25 NUMA node0 CPU(s):     0-31,128-159
26 NUMA node1 CPU(s):     32-63,160-191
27 NUMA node2 CPU(s):     64-95,192-223
28 NUMA node3 CPU(s):     96-127,224-255
29 #cat /sys/devices/system/cpu/cpu{0,1,8,16,30,31,32,128}/cache/index3/shared_cpu_map
30 00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff
31 00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff
32 00000000,00000000,00000000,0000ff00,00000000,00000000,00000000,0000ff00
33 00000000,00000000,00000000,00ff0000,00000000,00000000,00000000,00ff0000
34 00000000,00000000,00000000,ff000000,00000000,00000000,00000000,ff000000
35 00000000,00000000,00000000,ff000000,00000000,00000000,00000000,ff000000
36 00000000,00000000,000000ff,00000000,00000000,00000000,000000ff,00000000
37 00000000,00000000,00000000,000000ff,00000000,00000000,00000000,000000ff
38 #cat /sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map
39 00000000,00000000,00000000,00000001,00000000,00000000,00000000,00000001

L3 是 8 个物理核,16 个超线程共享,相当于单核 2MB,一块 CPU 有 8 个 L3,总共是256MB。

1 cat cpu0/cache/index3/shared_cpu_list
2 0-7,128-135
3 #cat cpu0/cache/index3/size
4 32768K
5 #cat cpu0/cache/index2/shared_cpu_list
6 0,128
7 #cat /sys/devices/system/cpu/cpu{0,1,8,16,30,31,32,128}/cache/index3/
8 0-7,128-135
9 0-7,128-135
10 8-15,136-143
11 16-23,144-151
12 24-31,152-159
13 24-31,152-159
14 32-39,160-167
15 0-7,128-135

L1D、L1I 各为 2MiB,单物理核为 32KB,空跑 nop 的 IPC 为 6(有点吓人)。

1 perf stat ./cpu/test
2 Performance counter stats for process id '449650':
3          2,574.29 msec task-clock                #    1.000 CPUs uti
4                  0      context-switches          #    0.000 K/sec
5                  0      cpu-migrations            #    0.000 K/sec
6                 0      page-faults               #    0.000 K/sec
7      8,985,622,182      cycles                    #    3.491 GHz     
8          4,390,929      stalled-cycles-frontend   #    0.05% frontend
9      4,387,560,442      stalled-cycles-backend    #   48.83% backend 
10     53,711,907,863      instructions              #    5.98  insn per
11                                                  #    0.08  stalled 
12       418,902,363      branches                  #  162.725 M/sec   
13             15,036      branch-misses             #    0.00% of all b
14        2.574347594 seconds time elapsed
15


sysbench 测 7T83 比 7H12 略好,可能是 ECS、OS 等带来的差异。

测试环境:4.19.91-011.ali4000.alios7.x86_64,5.7.34-log MySQL Community Server (GPL)。

测试核数AMD EPYC 7H12 2.5G(QPS、IPC)说明
单核24363 0.58CPU跑满
一对HT33519 0.40CPU跑满
2物理核(0-1)48423 0.57CPU跑满
2物理核(0,32) 跨node46232 0.55CPU跑满
2物理核(0,64) 跨socket45072 0.52CPU跑满
4物理核(0-3)97759 0.58CPU跑满
16物理核(0-15)367992 0.55CPU跑满,sys占比20%,si 10%
32物理核(0-31)686998 0.51CPU跑满,sys占比20%, si 12%
64物理核(0-63)1161079 0.50CPU跑到95%以上,sys占比20%, si 12%
64物理核(0-31,64-95)964441 0.49socket2上的32核一直比较闲,数据无参考意义
64物理核(0-31,64-95)1147846 0.48重启mysqld,立即绑核,sysbench 在32-63上,导致0-31的CPU只能跑到89%

说明,压测过程动态通过 taskset 绑核,所以会有数据残留其它核的 cache 问题。

跨 socket taskset 绑核的时候要压很久任务才会跨 socket 迁移过去,也就是刚 taskset 后CPU 是跑不满的。


1 numastat -p 437803
2 Per-node process memory usage (in MBs) for PID 437803 (mysqld)
3                            Node 0          Node 1          Node 2
4                   --------------- --------------- ---------------
5 Huge                         0.00            0.00            0.00
6 Heap                         1.15            0.00         5403.27
7 Stack                        0.00            0.00            0.09
8 Private                   1921.60           16.22        10647.66
9 ----------------  --------------- --------------- ---------------
10 Total                     1922.75           16.22        16051.02
11                            Node 3           Total
12                   --------------- ---------------
13 Huge                         0.00            0.00
14 Heap                         0.03         5404.45
15 Stack                        0.00            0.09
16 Private                     16.20        12601.68
17 ----------------  --------------- ---------------
18 Total                       16.23        18006.22


AMD EPYC 7H12(ECS)

AMD EPYC 7H12 64-Core(ECS,非物理机),最大 IPC 能到 5。

1 lscpu
2 Architecture:          x86_64
3 CPU op-mode(s):        32-bit, 64-bit
4 Byte Order:            Little Endian
5 CPU(s):                64
6 On-line CPU(s) list:   0-63
7 Thread(s) per core:    2
8 Core(s) per socket:    16
9 座:                 2
10 NUMA 节点:2
11 厂商 ID:AuthenticAMD
12 CPU 系列:23
13 型号:              49
14 型号名称:        AMD EPYC 7H12 64-Core Processor
15 步进:              0
16 CPU MHz:2595.124
17 BogoMIPS:            5190.24
18 虚拟化:           AMD-V
19 超管理器厂商:  KVM
20 虚拟化类型:     完全
21 L1d 缓存:32K
22 L1i 缓存:32K
23 L2 缓存:512K
24 L3 缓存:16384K
25 NUMA 节点0 CPU:0-31
26 NUMA 节点1 CPU:32-63

AMD EPYC 7T83 ECS

1 cd /sys/devices/system/cpu/cpu0
2 cat cache/index0/size
3 32K
4 cat cache/index1/size
5 32K
6 cat cache/index2/size
7 512K
8 cat cache/index3/size
9 32768K
10 lscpu
11 Architecture:          x86_64
12 CPU op-mode(s):        32-bit, 64-bit
13 Byte Order:            Little Endian
14 CPU(s):                16
15 On-line CPU(s) list:   0-15
16 Thread(s) per core:    2
17 Core(s) per socket:    8
18 座:1
19 NUMA 节点:1
20 厂商 ID:AuthenticAMD
21 CPU 系列:25
22 型号:1
23 型号名称:AMD EPYC 7T83 64-Core Processor
24 步进:1
25 CPU MHz:2545.218
26 BogoMIPS:5090.43
27 超管理器厂商:KVM
28 虚拟化类型:完全
29 L1d 缓存:32K
30 L1i 缓存:32K
31 L2 缓存:512K
32 L3 缓存:32768K
33 NUMA 节点0 CPU:0-15


stream:

1 for i in $(seq 0 15); do echo $i; numactl -C $i -m 0 ./bin/stream -W 
2 0
3 STREAM copy latency: 0.68 nanoseconds
4 STREAM copy bandwidth: 23509.84 MB/sec
5 STREAM scale latency: 0.69 nanoseconds
6 STREAM scale bandwidth: 23285.51 MB/sec
7 STREAM add latency: 0.96 nanoseconds
8 STREAM add bandwidth: 25043.73 MB/sec
9 STREAM triad latency: 1.40 nanoseconds
10 STREAM triad bandwidth: 17121.79 MB/sec
11 1
12 STREAM copy latency: 0.68 nanoseconds
13 STREAM copy bandwidth: 23513.96 MB/sec
14 STREAM scale latency: 0.68 nanoseconds
15 STREAM scale bandwidth: 23580.06 MB/sec
16 STREAM add latency: 0.96 nanoseconds
17 STREAM add bandwidth: 25049.96 MB/sec
18 STREAM triad latency: 1.35 nanoseconds
19 STREAM triad bandwidth: 17741.93 MB/sec

Intel 8163

这次对比测试的 Intel 8163 CPU 信息如下,最大IPC 是4:

1 lscpu
2 Architecture:          x86_64
3 CPU op-mode(s):        32-bit, 64-bit
4 Byte Order:            Little Endian
5 CPU(s):                96
6 On-line CPU(s) list:   0-95
7 Thread(s) per core:    2
8 Core(s) per socket:    24
9 Socket(s):             2
10 NUMA node(s):          1
11 Vendor ID:             GenuineIntel
12 CPU family:            6
13 Model:                 85
14 Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
15 Stepping:              4
16 CPU MHz:               2499.121
17 CPU max MHz:           3100.0000
18 CPU max MHz:           3100.0000
19 CPU max MHz:           3100.0000
20 Virtualization:        VT-x
21 Virtualization:        VT-x
22 L1i cache:             32K
23 L2 cache:              1024K
24 L3 cache:              33792K
25 NUMA node0 CPU(s):     0-95
26 -----8269CY
27 #lscpu
28 Architecture:          x86_64
29 CPU op-mode(s):        32-bit, 64-bit
30 Byte Order:            Little Endian
31 CPU(s):                104
32 On-line CPU(s) list:   0-103
33 Thread(s) per core:    2
34 Core(s) per socket:    26
35 Socket(s):             2
36 NUMA node(s):          2
37 Vendor ID:             GenuineIntel
38 CPU family:            6
39 Model:                 85
40 Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
41 Stepping:              7
42 CPU MHz:               3200.000
43 CPU max MHz:           3800.0000
44 CPU min MHz:           1200.0000
45 BogoMIPS:              4998.89
46 Virtualization:        VT-x
47 L1d cache:             32K
48 L1i cache:             32K
49 L2 cache:              1024K
50 L3 cache:              36608K
51 NUMA node0 CPU(s):     0-25,52-77
52 NUMA node1 CPU(s):     26-51,78-103


不同 intel 型号的差异

如下图是8269CY和E5-2682上跑的MySQL在相同业务、相同流量下的差异:

CPU使用率差异(下图8051C是E5-2682,其它是 8269CY,主频也有30%的差异)

鲲鹏920



1 numactl -H
2 available: 4 nodes (0-3)
3 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
4 node 0 size: 192832 MB
5 node 0 free: 146830 MB
6 node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
7 node 1 size: 193533 MB
8 node 1 free: 175354 MB
9 node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
10 node 2 size: 193533 MB
11 node 2 free: 175718 MB
12 node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
13 node 3 size: 193532 MB
14 node 3 free: 183643 MB
15 node distances:
16 node   0   1   2   3  
17 0:  10  12  20  22  
18 1:  12  10  22  24  
19 2:  20  22  10  12  
20 3:  22  24  12  10 
21
22 lscpu
23 Architecture:          aarch64
24 Byte Order:            Little Endian
25 CPU(s):                96
26 On-line CPU(s) list:   0-95
27 Thread(s) per core:    1
28 Core(s) per socket:    48
29 Socket(s):             2
30 NUMA node(s):          4
31 Model:                 0
32 CPU max MHz:           2600.0000
33 CPU min MHz:           200.0000
34 BogoMIPS:              200.00
35 L1d cache:             64K
36 L1i cache:             64K
37 L2 cache:              512K
38 L3 cache:              24576K
39 NUMA node0 CPU(s):     0-23
40 NUMA node1 CPU(s):     24-47
41 NUMA node2 CPU(s):     48-71
42 NUMA node3 CPU(s):     72-95
43 Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm


飞腾2500

飞腾2500用nop去跑IPC的话,只能到1,但是跑其它代码能到2.33



1 lscpu
2 Architecture:          aarch64
3 Byte Order:            Little Endian
4 CPU(s):                128
5 On-line CPU(s) list:   0-127
6 Thread(s) per core:    1
7 Core(s) per socket:    64
8 Socket(s):             2
9 NUMA node(s):          16
10 Model:                 3
11 BogoMIPS:              100.00

12 L1d cache:             32K

13 L1i cache:             32K
14 L2 cache:              2048K
15 L3 cache:              65536K
16 NUMA node0 CPU(s):     0-7
17 NUMA node1 CPU(s):     8-15
18 NUMA node2 CPU(s):     16-23
19 NUMA node3 CPU(s):     24-31
20 NUMA node4 CPU(s):     32-39
21 NUMA node5 CPU(s):     40-47
22 NUMA node6 CPU(s):     48-55
23 NUMA node7 CPU(s):     56-63
24 NUMA node8 CPU(s):     64-71
25 NUMA node9 CPU(s):     72-79
26 NUMA node10 CPU(s):    80-87
27 NUMA node11 CPU(s):    88-95
28 NUMA node12 CPU(s):    96-103
29 NUMA node13 CPU(s):    104-111
30 NUMA node14 CPU(s):    112-119
31 NUMA node15 CPU(s):    120-127
32 Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
33
34 perf stat ./nop
35 failed to read counter stalled-cycles-frontend
36 failed to read counter stalled-cycles-backend
37 failed to read counter branches
38  Performance counter stats for './nop':
     
39     78638.700540      task-clock (msec)         #    0.999 CPUs utilized
             
40              1479      context-switches          #    0.019 K/sec
               
41                55      cpu-migrations            #    0.001 K/sec
               
42                37      page-faults               #    0.000 K/sec
     
43      165127619524      cycles                    #    2.100 GHz
 
44   <not supported>      stalled-cycles-frontend
 
45   <not supported>      stalled-cycles-backend
     
46      165269372437      instructions              #    1.00  insns per cycle
 
47   <not supported>      branches
         
48           3057191      branch-misses             #    0.00% of all branches
     
49     78.692839007 seconds time elapsed
      
50     
51 dmidecode -t processor
52 dmidecode 3.0Getting
53 SMBIOS data from sysfs.

54 SMBIOS 3.2.0 present.

55 # SMBIOS implementations newer than version 3.0 are not

56 # fully supported by this version of dmidecode.
57 Handle 0x0004, DMI type 4, 48 bytes
58 Processor Information
 
59  Socket Designation: BGA3576
 
60  Type: Central Processor
 
61  Family: <OUT OF SPEC>
 
62  Manufacturer: PHYTIUM
 
63  ID: 00 00 00 00 70 1F 66 22
 
64  Version: S2500
 
65  Voltage: 0.8 V
 
66  External Clock: 50 MHz
 
67  Max Speed: 2100 MHz
 
68  Current Speed: 2100 MHz
 
69  Status: Populated, Enabled
 
70  Upgrade: Other
 
71  L1 Cache Handle: 0x0005
 
72  L2 Cache Handle: 0x0007
 
73  L3 Cache Handle: 0x0008
 
74  Serial Number: N/A
 
75  Asset Tag: No Asset Tag
 
76  Part Number: NULL
 
77  Core Count: 64
 
78  Core Enabled: 64
 
79  Thread Count: 64
 
80  Characteristics:
   
81    64-bit capable
   
82    Multi-Core
   
83    Hardware Thread
   
84    Execute Protection
   
85    Enhanced Virtualization
   
86    Power/Performance Control


其它

2 Die,2 node。




1 lscpu
2
Architecture:          aarch64
3
Byte Order:            Little Endian
4 CPU(s):                128
5 On-line CPU(s) list:   0-127
6 Thread(s) per core:    1
7 Core(s) per socket:    128
8 Socket(s):             1

9 NUMA node(s):          2
10 Model:                 0
11 BogoMIPS:              100.00
12 L1d cache:             64
13 K
L1i cache:             64K
14 L2 cache:              1024K
15 L3 cache:              65536K //64core share
16
NUMA node0 CPU(s):     0-63
17 NUMA node1 CPU(s):     64-127
18 Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh
19
20 cat cpu{0,1,8,16,30,31,32,127}/cache/index3/shared_cpu_list

21 0-63

22 0-63
23 0-63
24 0-63
25 0-63
26 0-63
27 0-63
28 64-127

29
30 grep -E "core|64.000" lat.log
31
core:0
32 64.00000 59.653
33 core:8
34 64.00000 62.265
35 core:16
36 64.00000 59.411
37 core:24
38 64.00000 55.836
39 core:32
40 64.00000 55.909
41 core:40
42 64.00000 56.176
43
core:48
44 64.00000 57.240
45 core:56
46 64.00000 59.485
47
core:64
48 64.00000 131.818
49
core:72
50 64.00000 127.182

51 core:80
52 64.00000 122.452
53 core:88

54 64.00000 121.673

55 core:96

56 64.00000 126.533
57 core:104
58 64.00000 125.673
59 core:112
60 64.00000 124.188
61 core:120
62 64.00000 130.202
63
64 numactl -H
65 available: 2 nodes (0-1)
66 node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

67 node 0 size: 515652 MB
68 node 0 free: 514913 MB
69 node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127

70 node 1 size: 516086 MB
71
node 1 free: 514815 MB
72 node distances:
73 node   0   1
74  0:  10  15

75  1:  15  10


单核以及HT计算Prime性能比较

以上两款CPU但从物理上的指标来看似乎AMD要好很多,从工艺上AMD也要领先一代(2年),从单核参数上来说是2.0 VS 2.5GHz,但是IPC 是5 VS 4,算下来理想的单核性能刚好一致(25=2.5 4)。

从外面的一些跑分结果显示也是AMD 要好,但是实际性能怎么样呢?

测试命令,这个测试命令无论在哪个CPU下,用2个物理核用时都是一个物理核的一半,所以这个计算是可以完全并行的:

1 taskset -c 1 /usr/bin/sysbench --num-threads=1 --test=cpu --cpu-max-prime=50000 run //单核用一个threads,绑核; HT用2个threads,绑一对HT


测试结果为耗时,单位秒

测试项AMD EPYC 7H12 2.5G CentOS 7.9Hygon 7280 2.1GHz CentOSHygon 7280 2.1GHz 麒麟Intel 8269 2.50GIntel 8163 CPU @ 2.50GHzIntel E5-2682 v4 @ 2.50GHz
单核 prime 50000 耗时59秒 IPC 0.5677秒 IPC 0.5589秒 IPC 0.56;83 0.41105秒 IPC 0.41109秒 IPC 0.39
HT prime 50000 耗时57秒 IPC 0.3174秒 IPC 0.2987秒 IPC 0.2948 0.3560秒 IPC 0.3674秒 IPC 0.29

相同CPU下的 指令数 基本= 耗时 IPC 核数

以上测试结果显示Hygon 7280单核计算能力是要强过Intel 8163的,但是超线程在这个场景下太不给力,相当于没有。

当然上面的计算Prime太单纯了,代表不了复杂的业务场景,所以接下来用MySQL的查询场景来看看。

如果是arm芯片在计算prime上明显要好过x86,猜测是除法取余指令上有优化

1 taskset -c 11 sysbench cpu --threads=1 --events=50000  run
2 sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)


测试结果为10秒钟的event

测试项FT2500 2.1G鲲鹏920-4826 2.6GHzIntel 8163 CPU @ 2.50GHzHygon C86 7280 2.1GHzAMD 7T83
单核 prime 10秒 events21626 IPC 0.8930299 IPC 1.018435 IPC 0.4110349 IPC 0.6340112 IPC 1.38

对比 MySQL sysbench 和 tpcc 性能

分别将  MySQL 5.7.34 社区版部署到 inte l+ AliOS 以及 hygon 7280 + CentOS  上,将mysqld绑定到单核,一样的压力配置均将CPU跑到100%,然后用sysbench测试点查, HT表示将mysqld绑定到一对HT核。

sysbench点查

测试命令类似如下:

1 sysbench --test='/usr/share/doc/sysbench/tests/db/select.lua' --oltp_tables_count=1 --report-interval=1 --oltp-table-size=10000000  --mysql-port=3307 --mysql-db=sysbench_single --mysql-user=root --mysql-password='Bj6f9g96!@#'  --max-requests=0   --oltp_skip_trx=on --oltp_auto_inc=on  --oltp_range_size=5  --mysql-table-engine=innodb --rand-init=on   --max-time=300 --mysql-host=x86.51 --num-threads=4 run


测试结果(测试中的差异AMD、Hygon CPU跑在CentOS7.9, intel CPU、Kunpeng 920 跑在AliOS上, xdb表示用集团的xdb替换社区的MySQL Server, 麒麟是国产OS):

测试核数AMD EPYC 7H12 2.5GHygon 7280 2.1GHygon 7280 2.1GHz 麒麟Intel 8269 2.50GIntel 8163 2.50GIntel 8163 2.50G XDB5.7鲲鹏 920-4826 2.6G鲲鹏 920-4826 2.6G XDB8.0FT2500 alisql 8.0 本地–socket
单核24674 0.5413441 0.4610236 0.3928208 0.7525474 0.8429376 0.899694 0.498301 0.463602 0.53
一对HT36157 0.4221747 0.3819417 0.3736754 0.4935894 0.640601 0.65无HT无HT无HT
4物理核94132 0.5249822 0.4638033 0.3790434 0.69 350%87254 0.73106472 0.8334686 0.4228407 0.3914232 0.53
16物理核325409 0.48171630 0.38134980 0.34371718 0.69 1500%332967 0.72446290 0.85 //16核比4核好!116122 0.3594697 0.3359199 0.6 8core:31210 0.59
32物理核542192 0.43298716 0.37255586 0.33642548 0.64 2700%588318 0.67598637 0.81 CPU 2400%228601 0.36177424 0.32114020 0.65
  • 麒麟OS下CPU很难跑满,大致能跑到90%-95%左右,麒麟上装的社区版MySQL-5.7.29;飞腾要特别注意mysqld所在socket,同时以上飞腾数据都是走–sock压测所得,32core走网络压测QPS为:99496(15%的网络损耗)[^说明]

Mysqld 二进制代码所在 page cache 带来的性能影响

如果是飞腾跨 socket 影响很大,mysqld 二进制跨 socket 性能会下降 30% 以上。

对于鲲鹏 920,双路服务器上测试,mysqld 绑在 node0, 但是分别将mysqld二进制load进不同的node上的page cache,然后执行点查:

mysqldnode0node1node2node3
QPS190120 IPC 0.40182518 IPC 0.39189046 IPC 0.40186533 IPC 0.40

以上数据可以看出这里 node0 到 node1 还是很慢的,居然比跨 socket 还慢,反过来说鲲鹏跨 socket 性能很好。

绑定 mysqld 到不同 node 的 page cache 操作:

1 systemctl stop mysql-server
2 vmtouch -e /usr/local/mysql/bin/mysqld
3           Files: 1
4     Directories: 0
5   Evicted Pages: 5916 (23M)
6         Elapsed: 0.00322 seconds
7 vmtouch -v /usr/local/mysql/bin/mysqld
8 /usr/local/mysql/bin/mysqld
9 [                                                            ] 0/5916
10           Files: 1
11     Directories: 0
12  Resident Pages: 0/5916  0/23M  0%
13         Elapsed: 0.000204 seconds
14 taskset -c 24 md5sum /usr/local/mysql/bin/mysqld
15 grep mysqld /proc/`pidof mysqld`/numa_maps  //检查mysqld具体绑定在哪个node上
16 00400000 default file=/usr/local/mysql/bin/mysqld mapped=3392 active=1 N0=3392 kernelpagesize_kB=4
17 0199b000 default file=/usr/local/mysql/bin/mysqld anon=10 dirty=10 mapped=134 active=10 N0=134 kernelpagesize_kB=4
18 01a70000 default file=/usr/local/mysql/bin/mysqld anon=43 dirty=43 mapped=120 active=43 N0=120 kernelpagesize_kB=4


网卡以及 node 距离带来的性能差异

在鲲鹏920+mysql5.7+alios,将内存分配锁在node0上,然后分别绑核在1、24、48、72core,进行sysbench点查对比“


Core1Core24Core48Core72
QPS108001040077007700

以上测试的时候业务进程分配的内存全限制在node0上(下面的网卡中断测试也是同样内存结构):



1 numa-maps-summary.pl </proc/123853/numa_maps
2 N0        :      5085548 ( 19.40 GB)
3 N1        :         4479 (  0.02 GB)
4 N2        :            1 (  0.00 GB)
5 active    :            0 (  0.00 GB)
6 anon      :      5085455 ( 19.40 GB)
7 dirty     :      5085455 ( 19.40 GB)
8 kernelpagesize_kB:         2176 (  0.01 GB)
9 mapmax    :          348 (  0.00 GB)
10 mapped    :         4626 (  0.02 GB)


对比测试,将内存锁在node3上,重复进行以上测试结果如下:


Core1Core24Core48Core72
QPS105001000081008000



1 numa-maps-summary.pl </proc/54478/numa_maps
2 N0        :           16 (  0.00 GB)
3 N1        :         4401 (  0.02 GB)
4 N2        :            1 (  0.00 GB)
5 N3        :      1779989 (  6.79 GB)
6 active    :            0 (  0.00 GB)
7anon      :      1779912 (  6.79 GB)
8 dirty     :      1779912 (  6.79 GB)
9 kernelpagesize_kB:         1108 (  0.00 GB)
10 mapmax    :          334 (  0.00 GB)
11 mapped    :         4548 (  0.02 GB)


机器上网卡eth1插在node0上,由以上两组对比测试发现网卡影响比内存跨node影响更大,网卡影响有20%。而内存的影响基本看不到(就近好那么一点点,但是不明显,只能解释为cache命中率很高了)。

此时软中断都在node0上,如果将软中断绑定到node3上,第72core的QPS能提升到8500,并且非常稳定。同时core0的QPS下降到10000附近。

网卡软中断以及网卡远近的测试结论

测试机器只是用了一块网卡,网卡插在node0上。

一般网卡中断会占用一些CPU,如果把网卡中断挪到其它node的core上,在鲲鹏920上测试,业务跑在node3(使用全部24core),网卡中断分别在node0和node3,QPS分别是:179000  VS 175000 (此时把中断放到node0或者是和node3最近的node2上差别不大)

如果将业务跑在node0上(全部24core),网卡中断分别在node0和node1上得到的QPS分别是:204000 VS 212000

tpcc 1000 仓

测试结果(测试中Hygon 7280分别跑在CentOS7.9和麒麟上, 鲲鹏/intel CPU 跑在AliOS、麒麟是国产OS):

tpcc测试数据,结果为1000仓,tpmC (NewOrders) ,未标注CPU 则为跑满了

测试核数Intel 8269 2.50GIntel 8163 2.50GHygon 7280 2.1GHz 麒麟Hygon 7280 2.1G CentOS 7.9鲲鹏 920-4826 2.6G鲲鹏 920-4826 2.6G XDB8.0
1物理核1239299024706701166194653
一对HT1789215324895011778无HT无HT
4物理核515254087719387 380%300462395920101
8物理核1007928179939664 750%600864236840572
16物理核160798 抖动140488 CPU抖动75013 1400%106419 1300-1550%70581 1200%79844
24物理核188051164757 1600-2100%100841 1800-2000%130815 1600-2100%88204 1600%115355
32物理核195292185171 2000-2500%116071 1900-2400%142746 1800-2400%102089 1900%143567
48物理核19969l195730 2100-2600%128188 2100-2800%149782 2000-2700%116374 2500%206055 4500%

tpcc并发到一定程度后主要是锁导致性能上不去,所以超多核意义不大。

如果在Hygon 7280 2.1GHz 麒麟上起两个MySQLD实例,每个实例各绑定32物理core,性能刚好翻倍:

测试过程CPU均跑满(未跑满的话会标注出来),IPC跑不起来性能就必然低,超线程虽然总性能好了但是会导致IPC降低(参考前面的公式)。可以看到对本来IPC比较低的场景,启用超线程后一般对性能会提升更大一些。

CPU核数增加到32核后,MySQL社区版性能追平xdb, 此时sysbench使用120线程压性能较好(AMD得240线程压)

32核的时候对比下MySQL 社区版在Hygon7280和Intel 8163下的表现:

三款CPU的性能指标

测试项AMD EPYC 7H12 2.5GHygon 7280 2.1GHzIntel 8163 CPU @ 2.50GHz
内存带宽(MiB/s)12190.506206.067474.45
内存延时(遍历很大一个数组)0.334ms0.336ms0.429ms

在 lmbench 上的测试数据

stream 主要用于测试带宽,对应的时延是在带宽跑满情况下的带宽。

lat_mem_rd 用来测试操作不同数据大小的时延。总的来说带宽看stream、时延看lat_mem_rd

飞腾2500

用stream测试带宽和latency,可以看到带宽随着numa距离不断减少、对应的latency不断增加,到最近的numa   node有10%的损耗,这个损耗和numactl给出的距离完全一致。跨socket访问内存latency是node内的3倍,带宽是三分之一,但是socket1性能和socket0性能完全一致

1 time for i in $(seq 7 8 128); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
2 numactl -C 7 -m 0 ./bin/stream  -W 5 -N 5 -M 64M
3
STREAM copy latency: 2.84 nanoseconds
4 STREAM copy bandwidth: 5638.21 MB/sec

5 STREAM scale latency: 2.72 nanoseconds

6 STREAM scale bandwidth: 5885.97 MB/sec

7 STREAM add latency: 2.26 nanoseconds

8 STREAM add bandwidth: 10615.13 MB/sec

9 STREAM triad latency: 4.53 nanoseconds

10 STREAM triad bandwidth: 5297.93 MB/sec

11 numactl -C 7 -m 1 ./bin/stream  -W 5 -N 5 -M 64M

12 STREAM copy latency: 3.16 nanoseconds

13 STREAM copy bandwidth: 5058.71 MB/sec

14 STREAM scale latency: 3.15 nanoseconds

15 STREAM scale bandwidth: 5074.78 MB/sec

16 STREAM add latency: 2.35 nanoseconds
17 STREAM add bandwidth: 10197.36 MB/sec
18 STREAM triad latency: 5.12 nanoseconds

19 STREAM triad bandwidth: 4686.37 MB/sec

20 numactl -C 7 -m 2 ./bin/stream  -W 5 -N 5 -M 64M

21 STREAM copy latency: 3.85 nanoseconds
22 STREAM copy bandwidth: 4150.98 MB/sec
23 STREAM scale latency: 3.95 nanoseconds
24 STREAM scale bandwidth: 4054.30 MB/sec
25 STREAM add latency: 2.64 nanoseconds
26 STREAM add bandwidth: 9100.12 MB/sec

27 STREAM triad latency: 6.39 nanoseconds
28 STREAM triad bandwidth: 3757.70 MB/sec
29 numactl -C 7 -m 3 ./bin/stream  -W 5 -N 5 -M 64M

30 STREAM copy latency: 3.69 nanoseconds
31 STREAM copy bandwidth: 4340.24 MB/sec
32 STREAM scale latency: 3.62 nanoseconds

33 STREAM scale bandwidth: 4422.18 MB/sec

34 STREAM add latency: 2.47 nanoseconds

35 STREAM add bandwidth: 9704.82 MB/sec
36 STREAM triad latency: 5.74 nanoseconds
37 STREAM triad bandwidth: 4177.85 MB/sec
38
39 numactl -C 7 -m 7 ./bin/stream  -W 5 -N 5 -M 64M
40 STREAM copy latency: 3.95 nanoseconds
41 STREAM copy bandwidth: 4051.51 MB/sec
42 STREAM scale latency: 3.94 nanoseconds
43 STREAM scale bandwidth: 4060.63 MB/sec
44 STREAM add latency: 2.54 nanoseconds
45 STREAM add bandwidth: 9434.51 MB/sec
46 STREAM triad latency: 6.13 nanoseconds
47 STREAM triad bandwidth: 3913.36 MB/sec
48
49 numactl -C 7 -m 10 ./bin/stream  -W 5 -N 5 -M 64M
50 STREAM copy latency: 8.80 nanoseconds
51 STREAM copy bandwidth: 1817.78 MB/sec
52 STREAM scale latency: 8.59 nanoseconds
53 STREAM scale bandwidth: 1861.65 MB/sec

54 STREAM add latency: 5.55 nanoseconds

55 STREAM add bandwidth: 4320.68 MB/sec
56 STREAM triad latency: 13.94 nanoseconds
57 STREAM triad bandwidth: 1721.76 MB/sec
58
59
60 numactl -C 7 -m 11 ./bin/stream  -W 5 -N 5 -M 64M
61 STREAM copy latency: 9.27 nanoseconds

62 STREAM copy bandwidth: 1726.52 MB/sec

63 STREAM scale latency: 9.31 nanoseconds

64 STREAM scale bandwidth: 1718.10 MB/sec
65 STREAM add latency: 5.65 nanoseconds
66 STREAM add bandwidth: 4250.89 MB/sec
67 STREAM triad latency: 14.09 nanoseconds

68 STREAM triad bandwidth: 1703.66 MB/sec

69
70
71 numactl -C 88 -m 11 ./bin/stream  -W 5 -N 5 -M 64M //在另外一个socket上测试本numa,和node0性能完全一致
72 STREAM copy latency: 2.93 nanoseconds
73 STREAM copy bandwidth: 5454.67 MB/sec
74 STREAM scale latency: 2.96 nanoseconds
75 STREAM scale bandwidth: 5400.03 MB/sec
76 STREAM add latency: 2.28 nanoseconds

77 STREAM add bandwidth: 10543.42 MB/sec

78 STREAM triad latency: 4.52 nanoseconds
79 STREAM triad bandwidth: 5308.40 MB/sec
80
81
82 numactl -C 7 -m 15 ./bin/stream  -W 5 -N 5 -M 64M
83 STREAM copy latency: 8.73 nanoseconds

84 STREAM copy bandwidth: 1831.77 MB/sec

85 STREAM scale latency: 8.81 nanoseconds

86 STREAM scale bandwidth: 1815.13 MB/sec

87 STREAM add latency: 5.63 nanoseconds

88STREAM add bandwidth: 4265.21 MB/sec
89 STREAM triad latency: 13.09 nanoseconds
90 STREAM triad bandwidth: 1833.68 MB/sec


Lat_mem_rd 用cpu7访问node0和node15对比结果,随着数据的加大,延时在加大,64M时能有3倍差距,和上面测试一致

下图 第一列 表示读写数据的大小(单位M),第二列表示访问延时(单位纳秒),一般可以看到在L1/L2/L3 cache大小的地方延时会有跳跃,远超过L3大小后,延时就是内存延时了

1 numactl -C 7 -m 0 ./bin/lat_mem_rd -W 5 -N 5 -t 64M  //-C 7 cpu 7, -m 0 node0, -W 热身 -t stride


同样的机型,开关numa的测试结果,关numa 时延、带宽都差了几倍。

关闭numa的机器上测试结果随机性很强,这应该是和内存分配在那里有关系,不过如果机器一直保持这个状态反复测试的话,快的core一直快,慢的core一直慢,这是因为物理地址分配有一定的规律,在物理内存没怎么变化的情况下,快的core恰好分到的内存比较近。

同时不同机器状态(内存使用率)测试结果也不一样

鲲鹏920



11 numactl -C 7 -m 1 ./bin/stream  -W 5 -N 5 -M 64M
12 STREAM copy latency: 2.05 nanoseconds
13 STREAM copy bandwidth: 7802.45 MB/sec
14 STREAM scale latency: 2.08 nanoseconds
15 STREAM scale bandwidth: 7681.87 MB/sec
16 STREAM add latency: 2.19 nanoseconds
17 STREAM add bandwidth: 10954.76 MB/sec
18 STREAM triad latency: 3.17 nanoseconds
19 STREAM triad bandwidth: 7559.86 MB/sec
20
21 numactl -C 7 -m 2 ./bin/stream  -W 5 -N 5 -M 64M
22 STREAM copy latency: 3.51 nanoseconds
23 STREAM copy bandwidth: 4556.86 MB/sec
24 STREAM scale latency: 3.58 nanoseconds
25 STREAM scale bandwidth: 4463.66 MB/sec
26 STREAM add latency: 2.71 nanoseconds
27 STREAM add bandwidth: 8869.79 MB/sec
28 STREAM triad latency: 5.92 nanoseconds
29 STREAM triad bandwidth: 4057.12 MB/sec
30
31 numactl -C 7 -m 3 ./bin/stream  -W 5 -N 5 -M 64M
32 STREAM copy latency: 3.94 nanoseconds
33 STREAM copy bandwidth: 4064.25 MB/sec
34 STREAM scale latency: 3.82 nanoseconds
35 STREAM scale bandwidth: 4188.67 MB/sec
36 STREAM add latency: 2.86 nanoseconds
37 STREAM add bandwidth: 8390.70 MB/sec
38 STREAM triad latency: 4.78 nanoseconds
39 STREAM triad bandwidth: 5024.25 MB/sec
40
41 numactl -C 24 -m 3 ./bin/stream  -W 5 -N 5 -M 64M
42 STREAM copy latency: 4.10 nanoseconds
43 STREAM copy bandwidth: 3904.63 MB/sec
44 STREAM scale latency: 4.03 nanoseconds
45 STREAM scale bandwidth: 3969.41 MB/sec
46 STREAM add latency: 3.07 nanoseconds
47 STREAM add bandwidth: 7816.08 MB/sec
48 STREAM triad latency: 5.06 nanoseconds
49 STREAM triad bandwidth: 4738.66 MB/sec


海光7280

可以看到跨numa(一个numa也就是一个socket,等同于跨socket)RT从1.5上升到2.5,这个数据比鲲鹏920要好很多。



1 lscpu
2 架构:                           x86_64
3 CPU 运行模式:32-bit, 64-bit
4 字节序:                         Little Endian
5 Address sizes:                   43 bits physical, 48 bits virtual
6 CPU:                             128
7 在线 CPU 列表:0-127
8 每个核的线程数:                 2
9 每个座的核数:                   32
10 座:                             2
11 NUMA 节点:8
12 厂商 ID:HygonGenuine
13 CPU 系列:24
14 型号:                           1
15 型号名称:                       Hygon C86 7280 32-core Processor
16 步进:                           1
17 CPU MHz:2194.586
18 BogoMIPS:                       3999.63
19 虚拟化:                         AMD-V
20 L1d 缓存:2 MiB
21 L1i 缓存:4 MiB
22 L2 缓存:32 MiB
23 L3 缓存:128 MiB
24 NUMA 节点0 CPU:0-7,64-71
25 UMA 节点1 CPU:8-15,72-79
26 NUMA 节点2 CPU:16-23,80-87
27 NUMA 节点3 CPU:24-31,88-95
28 NUMA 节点4 CPU:32-39,96-103
29 NUMA 节点5 CPU:40-47,104-111
30 NUMA 节点6 CPU:48-55,112-119
31 NUMA 节点7 CPU:56-63,120-127
32 //可以看到7号core比15、23、31号core明显要快,就近访问node 0的内存,跨numa node(跨Die)没有内存交织分配
33
34
35 time for i in $(seq 7 8 64); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
36 7
37 STREAM copy latency: 1.38 nanoseconds    
38 STREAM copy bandwidth: 11559.53 MB/sec
39 STREAM scale latency: 1.16 nanoseconds
40 STREAM scale bandwidth: 13815.87 MB/sec
41 STREAM add latency: 1.40 nanoseconds
42 STREAM add bandwidth: 17145.85 MB/sec
43 STREAM triad latency: 1.44 nanoseconds
44 STREAM triad bandwidth: 16637.18 MB/sec
45 15
46 STREAM copy latency: 1.67 nanoseconds
47 STREAM copy bandwidth: 9591.77 MB/sec
48 STREAM scale latency: 1.56 nanoseconds
49 STREAM scale bandwidth: 10242.50 MB/sec
50 STREAM add latency: 1.45 nanoseconds
51 STREAM add bandwidth: 16581.00 MB/sec
52 STREAM triad latency: 2.00 nanoseconds
53 STREAM triad bandwidth: 12028.83 MB/sec
54 23
55 STREAM copy latency: 1.65 nanoseconds
56 STREAM copy bandwidth: 9701.49 MB/sec
57 STREAM scale latency: 1.53 nanoseconds
58 STREAM scale bandwidth: 10427.98 MB/sec
59 STREAM add latency: 1.42 nanoseconds
60 STREAM add bandwidth: 16846.10 MB/sec
61 STREAM triad latency: 1.97 nanoseconds
62 STREAM triad bandwidth: 12189.72 MB/sec
63 31
64 STREAM copy latency: 1.64 nanoseconds
65 STREAM copy bandwidth: 9742.86 MB/sec
66 STREAM scale latency: 1.52 nanoseconds
67 STREAM scale bandwidth: 10510.80 MB/sec
68 STREAM add latency: 1.45 nanoseconds
69 STREAM add bandwidth: 16559.86 MB/sec
70 STREAM triad latency: 1.92 nanoseconds
71 STREAM triad bandwidth: 12490.01 MB/sec
72 39
73 STREAM copy latency: 2.55 nanoseconds
74 STREAM copy bandwidth: 6286.25 MB/sec
75 STREAM scale latency: 2.51 nanoseconds
76 STREAM scale bandwidth: 6383.11 MB/sec
77 STREAM add latency: 1.76 nanoseconds
78 STREAM add bandwidth: 13660.83 MB/sec
79 STREAM triad latency: 3.68 nanoseconds
80 STREAM triad bandwidth: 6523.02 MB/sec


如果这种芯片在bios里设置Die interleaving,4块die当成一个numa node吐出来给OS



1 lscpu
2 架构:                           x86_64
3 CPU 运行模式:32-bit, 64-bit
4 字节序:                         Little Endian
5 Address sizes:                   43 bits physical, 48 bits virtual
6 CPU:                             128
7 在线 CPU 列表:0-127
8 每个核的线程数:                 2
9 每个座的核数:                   32
10 座:                             2
11 NUMA 节点:2
12 厂商 ID:HygonGenuine
13 CPU 系列:24
14 型号:                           1
15 型号名称:                       Hygon C86 7280 32-core Processor
16 步进:                           1
17CPU MHz:2108.234
18 BogoMIPS:                       3999.45
19 虚拟化:                         AMD-V
20 L1d 缓存:2 MiB
21 L1i 缓存:4 MiB
22 L2 缓存:32 MiB
23 L3 缓存:128 MiB
24 //注意这里和真实物理架构不一致,bios配置了Die Interleaving Enable
25 //表示每路内多个Die内存交织分配,这样整个一路就是一个大Die
26 NUMA 节点0 CPU:0-31,64-95  
27 NUMA 节点1 CPU:32-63,96-127
28 //enable die interleaving 后继续streaming测试
29 //最终测试结果表现就是7/15/23/31 core性能一致,因为默认一个numa内内存交织分配
30 //可以看到同一路下的四个die内存交织访问,所以4个node内存延时一样了(被平均),都不如就近快
31
32
33 time for i in $(seq 7 8 64); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
34 7
35 STREAM copy latency: 1.48 nanoseconds
36 STREAM copy bandwidth: 10782.58 MB/sec
37 STREAM scale latency: 1.20 nanoseconds
38 STREAM scale bandwidth: 13364.38 MB/sec
39 STREAM add latency: 1.46 nanoseconds
40 STREAM add bandwidth: 16408.32 MB/sec
41 STREAM triad latency: 1.53 nanoseconds
42 STREAM triad bandwidth: 15696.00 MB/sec
43 15
44 STREAM copy latency: 1.51 nanoseconds
45 STREAM copy bandwidth: 10601.25 MB/sec
46 STREAM scale latency: 1.24 nanoseconds
47 STREAM scale bandwidth: 12855.87 MB/sec
48 STREAM add latency: 1.46 nanoseconds
49 STREAM add bandwidth: 16382.42 MB/sec
50 STREAM triad latency: 1.53 nanoseconds
51 STREAM triad bandwidth: 15691.48 MB/sec
52 23
53 STREAM copy latency: 1.50 nanoseconds
54 STREAM copy bandwidth: 10700.61 MB/sec
55 STREAM scale latency: 1.27 nanoseconds
56 STREAM scale bandwidth: 12634.63 MB/sec
57 STREAM add latency: 1.47 nanoseconds
58 STREAM add bandwidth: 16370.67 MB/sec
59 STREAM triad latency: 1.55 nanoseconds
60 STREAM triad bandwidth: 15455.75 MB/sec
61 31
62 STREAM copy latency: 1.50 nanoseconds
63 STREAM copy bandwidth: 10637.39 MB/sec
64 STREAM scale latency: 1.25 nanoseconds
65 STREAM scale bandwidth: 12778.99 MB/sec
66 STREAM add latency: 1.46 nanoseconds
67 STREAM add bandwidth: 16420.65 MB/sec
68 STREAM triad latency: 1.61 nanoseconds
69 STREAM triad bandwidth: 14946.80 MB/sec
70 39
71 STREAM copy latency: 2.35 nanoseconds
72 STREAM copy bandwidth: 6807.09 MB/sec
73 STREAM scale latency: 2.32 nanoseconds
74 STREAM scale bandwidth: 6906.93 MB/sec
75 STREAM add latency: 1.63 nanoseconds
76 STREAM add bandwidth: 14729.23 MB/sec
77 STREAM triad latency: 3.36 nanoseconds
78 STREAM triad bandwidth: 7151.67 MB/sec
79 47
80 STREAM copy latency: 2.31 nanoseconds
81 STREAM copy bandwidth: 6938.47 MB/sec


以华为泰山服务器(鲲鹏920芯片)配置为例:

Die Interleaving 控制是否使能DIE交织。使能DIE交织能充分利用系统的DDR带宽,并尽量保证各DDR通道的带宽均衡,提升DDR的利用率

hygon5280测试数据



1 for i in $(seq 0 8 24); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
2 0
3 STREAM copy latency: 1.22 nanoseconds
4 STREAM copy bandwidth: 13166.34 MB/sec
5 STREAM scale latency: 1.13 nanoseconds
6 STREAM scale bandwidth: 14166.95 MB/sec
7 STREAM add latency: 1.15 nanoseconds
8 STREAM add bandwidth: 20818.63 MB/sec
9 STREAM triad latency: 1.39 nanoseconds
10 STREAM triad bandwidth: 17211.81 MB/sec
11 8
12 STREAM copy latency: 1.56 nanoseconds
13 STREAM copy bandwidth: 10273.07 MB/sec
14 STREAM scale latency: 1.50 nanoseconds
15 STREAM scale bandwidth: 10701.89 MB/sec
16 STREAM add latency: 1.20 nanoseconds
17 STREAM add bandwidth: 19996.68 MB/sec
18 STREAM triad latency: 1.93 nanoseconds
19 STREAM triad bandwidth: 12443.70 MB/sec
20 16
21 STREAM copy latency: 2.52 nanoseconds
22 STREAM copy bandwidth: 6357.71 MB/sec
23 STREAM scale latency: 2.48 nanoseconds
24 STREAM scale bandwidth: 6454.95 MB/sec
25 STREAM add latency: 1.67 nanoseconds
26 STREAM add bandwidth: 14362.51 MB/sec
27 STREAM triad latency: 3.65 nanoseconds
28 STREAM triad bandwidth: 6572.85 MB/sec
29 24
30 STREAM copy latency: 2.44 nanoseconds
31 STREAM copy bandwidth: 6554.24 MB/sec
32 STREAM scale latency: 2.41 nanoseconds
33 STREAM scale bandwidth: 6642.80 MB/sec
34 STREAM add latency: 1.44 nanoseconds
35 STREAM add bandwidth: 16695.82 MB/sec
36 STREAM triad latency: 3.61 nanoseconds
37 STREAM triad bandwidth: 6639.18 MB/sec
38
39 lscpu
40 架构:                           x86_64
41 CPU 运行模式:32-bit, 64-bit
42 字节序:                         Little Endian
43 Address sizes:                   43 bits physical, 48 bits virtual
44 CPU:                             64
45 在线 CPU 列表:0-63
46 每个核的线程数:                 2
47 每个座的核数:                   16
48 座:                             2
49 NUMA 节点:4
50 厂商 ID:HygonGenuine
51 CPU 系列:24
52 型号:                           1
53 型号名称:                       Hygon C86 5280 16-core Processor
54 步进:                           1
55 Frequency boost:                 enabled
56 CPU MHz:2799.311
57 CPU 最大 MHz:2500.0000
58 CPU 最小 MHz:1600.0000
59 BogoMIPS:                       4999.36
60 虚拟化:                         AMD-V
61 L1d 缓存:1 MiB
62 L1i 缓存:2 MiB
63 L2 缓存:16 MiB
64 L3 缓存:64 MiB
65 NUMA 节点0 CPU:0-7,32-39
66 NUMA 节点1 CPU:8-15,40-47
67 NUMA 节点2 CPU:16-23,48-55
68 NUMA 节点3 CPU:24-31,56-63
69 Vulnerability Itlb multihit:     Not affected
70 Vulnerability L1tf:              Not affected
71 Vulnerability Mds:               Not affected
72 Vulnerability Meltdown:          Not affected
73 Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
74 Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
75 Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, STIBP disabled, RSB
76                               filling
77 Vulnerability Srbds:             Not affected
78 Vulnerability Tsx async abort:   Not affected
79 标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse3
80                                6 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdts
81                                cp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm
82                                aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe p
83                               opcnt xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy
84                                abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfct
85                                 r_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev
86                                 ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt s
87                                 ha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt
88                                  lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassist
89                                 s pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov suc
90                                 cor smca       
       


intel 8269CY



1 lscpu
2 Architecture:          x86_64
3 CPU op-mode(s):        32-bit, 64-bit
4 Byte Order:            Little Endian
5 CPU(s):                104
6 On-line CPU(s) list:   0-103
7 Thread(s) per core:    2
8 Core(s) per socket:    26
9 Socket(s):             2
10 NUMA node(s):          2
11 Vendor ID:             GenuineIntel
12 CPU family:            6
13 Model:                 85
14 Model name:            Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
15 Stepping:              7
16 CPU MHz:               3200.000
17 CPU max MHz:           3800.0000
18 CPU min MHz:           1200.0000
19 BogoMIPS:              4998.89
20 Virtualization:        VT-x
21 L1d cache:             32K
22 L1i cache:             32K
23 L2 cache:              1024K
24 L3 cache:              36608K
25 NUMA node0 CPU(s):     0-25,52-77
26 NUMA node1 CPU(s):     26-51,78-103
27
28 time for i in $(seq 0 8 51); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
29 0
30 STREAM copy latency: 1.15 nanoseconds
31 STREAM copy bandwidth: 13941.80 MB/sec
32 STREAM scale latency: 1.16 nanoseconds
33 STREAM scale bandwidth: 13799.89 MB/sec
34 STREAM add latency: 1.31 nanoseconds
35 STREAM add bandwidth: 18318.23 MB/sec
36 STREAM triad latency: 1.56 nanoseconds
37 STREAM triad bandwidth: 15356.72 MB/sec
38 16
39 STREAM copy latency: 1.12 nanoseconds
40 STREAM copy bandwidth: 14293.68 MB/sec
41 STREAM scale latency: 1.13 nanoseconds
42 STREAM scale bandwidth: 14162.47 MB/sec
43 STREAM add latency: 1.31 nanoseconds
44 STREAM add bandwidth: 18293.27 MB/sec
45 STREAM triad latency: 1.53 nanoseconds
46 STREAM triad bandwidth: 15692.47 MB/sec
47 32
48 STREAM copy latency: 1.52 nanoseconds
49 STREAM copy bandwidth: 10551.71 MB/sec
50 STREAM scale latency: 1.52 nanoseconds
51 STREAM scale bandwidth: 10508.33 MB/sec
52 STREAM add latency: 1.38 nanoseconds
53 STREAM add bandwidth: 17363.22 MB/sec
54 STREAM triad latency: 2.00 nanoseconds
55 STREAM triad bandwidth: 12024.52 MB/sec
56 40
57 STREAM copy latency: 1.49 nanoseconds
58 STREAM copy bandwidth: 10758.50 MB/sec
59 STREAM scale latency: 1.50 nanoseconds
60 STREAM scale bandwidth: 10680.17 MB/sec
61 STREAM add latency: 1.34 nanoseconds
62 STREAM add bandwidth: 17948.34 MB/sec
63 STREAM triad latency: 1.98 nanoseconds
64 STREAM triad bandwidth: 12133.22 MB/sec
65 48
66 STREAM copy latency: 1.49 nanoseconds
67 STREAM copy bandwidth: 10736.56 MB/sec
68 STREAM scale latency: 1.50 nanoseconds
69 STREAM scale bandwidth: 10692.93 MB/sec
70 STREAM add latency: 1.34 nanoseconds
71 STREAM add bandwidth: 17902.85 MB/sec
72 STREAM triad latency: 1.96 nanoseconds
73 STREAM triad bandwidth: 12239.44 MB/sec


Intel(R) Xeon(R) CPU E5-2682 v4



1 time for i in $(seq 0 8 51); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
2 0
3 STREAM copy latency: 1.59 nanoseconds
4 STREAM copy bandwidth: 10092.31 MB/sec
5 STREAM scale latency: 1.57 nanoseconds
6 STREAM scale bandwidth: 10169.16 MB/sec
7 STREAM add latency: 1.31 nanoseconds
8 STREAM add bandwidth: 18360.83 MB/sec
9 STREAM triad latency: 2.28 nanoseconds
10 STREAM triad bandwidth: 10503.81 MB/sec
11 8
12 STREAM copy latency: 1.55 nanoseconds
13 STREAM copy bandwidth: 10312.14 MB/sec
14 STREAM scale latency: 1.56 nanoseconds
15 STREAM scale bandwidth: 10283.70 MB/sec
16 STREAM add latency: 1.30 nanoseconds
17 STREAM add bandwidth: 18416.26 MB/sec
18 STREAM triad latency: 2.23 nanoseconds
19 STREAM triad bandwidth: 10777.08 MB/sec
20 16
21 STREAM copy latency: 2.02 nanoseconds
22 STREAM copy bandwidth: 7914.25 MB/sec
23 STREAM scale latency: 2.02 nanoseconds
24 STREAM scale bandwidth: 7919.85 MB/sec
25 STREAM add latency: 1.39 nanoseconds
26 STREAM add bandwidth: 17276.06 MB/sec
27 STREAM triad latency: 2.92 nanoseconds
28 STREAM triad bandwidth: 8231.18 MB/sec
29 24
30 STREAM copy latency: 1.99 nanoseconds
31 STREAM copy bandwidth: 8032.18 MB/sec
32 STREAM scale latency: 1.98 nanoseconds
33 STREAM scale bandwidth: 8061.12 MB/sec
34 STREAM add latency: 1.39 nanoseconds
35 STREAM add bandwidth: 17313.94 MB/sec
36 STREAM triad latency: 2.88 nanoseconds
37 STREAM triad bandwidth: 8318.93 MB/sec
38
39 lscpu
40 Architecture:          x86_64
41 CPU op-mode(s):        32-bit, 64-bit
42 Byte Order:            Little Endian
43 CPU(s):                64
44 On-line CPU(s) list:   0-63
45 Thread(s) per core:    2
46 Core(s) per socket:    16
47 Socket(s):             2
48 NUMA node(s):          2
49 Vendor ID:             GenuineIntel
50 CPU family:            6
51 Model:                 79
52 Model name:            Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
53 Stepping:              1
54 CPU MHz:               2500.000
55 CPU max MHz:           3000.0000
56 CPU min MHz:           1200.0000
57 BogoMIPS:              5000.06
58 Virtualization:        VT-x
59 L1d cache:             32K
60 L1i cache:             32K
61 L2 cache:              256K
62 L3 cache:              40960K
63 NUMA node0 CPU(s):     0-15,32-47
64 NUMA node1 CPU(s):     16-31,48-63

AMD EPYC 7T83



1 time for i in $(seq 0 8 255); do echo $i; numactl -C $i -m 0 ./bin/stream -W 5 -N 5 -M 64M; done
2 0
3 STREAM copy latency: 0.49 nanoseconds
4 STREAM copy bandwidth: 32561.30 MB/sec
5 STREAM scale latency: 0.49 nanoseconds
6 STREAM scale bandwidth: 32620.66 MB/sec
7 STREAM add latency: 0.87 nanoseconds
8 STREAM add bandwidth: 27575.20 MB/sec
9 STREAM triad latency: 0.70 nanoseconds
10 STREAM triad bandwidth: 34397.15 MB/sec
11 8
12 STREAM copy latency: 0.52 nanoseconds
13 STREAM copy bandwidth: 30764.47 MB/sec
14 STREAM scale latency: 0.53 nanoseconds
15 STREAM scale bandwidth: 30056.59 MB/sec
16 STREAM add latency: 0.87 nanoseconds
17 STREAM add bandwidth: 27575.20 MB/sec
18 STREAM triad latency: 0.69 nanoseconds
19 STREAM triad bandwidth: 34789.45 MB/sec
20 16
21 STREAM copy latency: 0.53 nanoseconds
22 STREAM copy bandwidth: 30173.15 MB/sec
23 STREAM scale latency: 0.54 nanoseconds
24 STREAM scale bandwidth: 29895.91 MB/sec
25 STREAM add latency: 0.87 nanoseconds
26 STREAM add bandwidth: 27496.11 MB/sec
27 STREAM triad latency: 0.70 nanoseconds
S28 TREAM triad bandwidth: 34128.93 MB/sec
29 24
30 STREAM copy latency: 0.78 nanoseconds
30 STREAM copy bandwidth: 20417.69 MB/sec
32 STREAM scale latency: 0.51 nanoseconds
33 STREAM scale bandwidth: 31354.70 MB/sec
34 STREAM add latency: 0.87 nanoseconds
35 STREAM add bandwidth: 27548.79 MB/sec
36 STREAM triad latency: 0.69 nanoseconds
37 STREAM triad bandwidth: 34589.22 MB/sec
38 32
39 STREAM copy latency: 0.60 nanoseconds
40 STREAM copy bandwidth: 26862.34 MB/sec
41 STREAM scale latency: 0.58 nanoseconds
42 STREAM scale bandwidth: 27376.00 MB/sec
43 STREAM add latency: 0.87 nanoseconds
44 STREAM add bandwidth: 27518.66 MB/sec
45 STREAM triad latency: 0.78 nanoseconds
46 STREAM triad bandwidth: 30779.17 MB/sec
47 40
48 STREAM copy latency: 0.59 nanoseconds
49 STREAM copy bandwidth: 27230.21 MB/sec
50 STREAM scale latency: 0.59 nanoseconds
51 STREAM scale bandwidth: 27284.18 MB/sec
52 STREAM add latency: 0.87 nanoseconds
53 STREAM add bandwidth: 27503.63 MB/sec
54 STREAM triad latency: 0.77 nanoseconds
55 STREAM triad bandwidth: 31242.48 MB/sec
56 48
57 STREAM copy latency: 0.59 nanoseconds
58 STREAM copy bandwidth: 27102.37 MB/sec
59 STREAM scale latency: 0.59 nanoseconds
60 STREAM scale bandwidth: 27164.08 MB/sec
61 STREAM add latency: 0.87 nanoseconds
62 STREAM add bandwidth: 27503.63 MB/sec
63 STREAM triad latency: 0.76 nanoseconds
64 STREAM triad bandwidth: 31422.90 MB/sec
65 56
66 STREAM copy latency: 0.92 nanoseconds
67 STREAM copy bandwidth: 17453.54 MB/sec
68 STREAM scale latency: 0.59 nanoseconds
69 STREAM scale bandwidth: 27267.55 MB/sec
70 STREAM add latency: 0.87 nanoseconds
71 STREAM add bandwidth: 27488.61 MB/sec
72 STREAM triad latency: 0.77 nanoseconds
73 STREAM triad bandwidth: 31169.92 MB/sec
74 64
75 STREAM copy latency: 0.88 nanoseconds
76 STREAM copy bandwidth: 18231.15 MB/sec
77 STREAM scale latency: 0.84 nanoseconds
78 STREAM scale bandwidth: 18976.06 MB/sec
79 STREAM add latency: 0.91 nanoseconds
80 STREAM add bandwidth: 26413.87 MB/sec
81 STREAM triad latency: 1.08 nanoseconds
82 STREAM triad bandwidth: 22310.12 MB/sec
83 72
84 STREAM copy latency: 0.86 nanoseconds
85 STREAM copy bandwidth: 18552.45 MB/sec
86 STREAM scale latency: 0.84 nanoseconds
87 STREAM scale bandwidth: 19113.88 MB/sec
88 STREAM add latency: 0.91 nanoseconds
89 STREAM add bandwidth: 26375.81 MB/sec
90 STREAM triad latency: 1.08 nanoseconds
91 STREAM triad bandwidth: 22151.79 MB/sec
92 80
93 STREAM copy latency: 0.89 nanoseconds
94 STREAM copy bandwidth: 18037.59 MB/sec
95 STREAM scale latency: 0.87 nanoseconds
96 STREAM scale bandwidth: 18398.59 MB/sec
97 STREAM add latency: 0.92 nanoseconds
98 STREAM add bandwidth: 26142.91 MB/sec
99 STREAM triad latency: 1.08 nanoseconds
100 STREAM triad bandwidth: 22133.53 MB/sec
101 88
102 STREAM copy latency: 0.93 nanoseconds
103 STREAM copy bandwidth: 17119.60 MB/sec
104 STREAM scale latency: 0.94 nanoseconds
105 STREAM scale bandwidth: 17030.54 MB/sec
106 STREAM add latency: 0.92 nanoseconds
107 STREAM add bandwidth: 26146.30 MB/sec
108 STREAM triad latency: 1.08 nanoseconds
109 STREAM triad bandwidth: 22159.10 MB/sec
110 96
111 STREAM copy latency: 1.39 nanoseconds
112 STREAM copy bandwidth: 11512.93 MB/sec
113 STREAM scale latency: 0.87 nanoseconds
114 STREAM scale bandwidth: 18406.16 MB/sec
115 STREAM add latency: 0.92 nanoseconds
116 STREAM add bandwidth: 25991.03 MB/sec
117 STREAM triad latency: 1.09 nanoseconds
118 STREAM triad bandwidth: 22078.91 MB/sec
119 104
120 STREAM copy latency: 0.86 nanoseconds
121 STREAM copy bandwidth: 18546.04 MB/sec
122 STREAM scale latency: 1.39 nanoseconds
123 STREAM scale bandwidth: 11518.85 MB/sec
124 STREAM add latency: 0.91 nanoseconds
125 STREAM add bandwidth: 26300.01 MB/sec
126 STREAM triad latency: 1.06 nanoseconds
127 STREAM triad bandwidth: 22599.38 MB/sec
128 112
129 STREAM copy latency: 0.88 nanoseconds
130 STREAM copy bandwidth: 18253.46 MB/sec
131 STREAM scale latency: 0.85 nanoseconds
132 STREAM scale bandwidth: 18758.59 MB/sec
133 STREAM add latency: 0.91 nanoseconds
134 STREAM add bandwidth: 26413.87 MB/sec
135 STREAM triad latency: 1.06 nanoseconds
136 STREAM triad bandwidth: 22648.95 MB/sec
137 120
138 STREAM copy latency: 0.86 nanoseconds
139 STREAM copy bandwidth: 18607.75 MB/sec
140 STREAM scale latency: 0.84 nanoseconds
141 STREAM scale bandwidth: 18957.30 MB/sec
142 STREAM add latency: 0.91 nanoseconds
143 STREAM add bandwidth: 26427.74 MB/sec
144 STREAM triad latency: 1.08 nanoseconds
145 STREAM triad bandwidth: 22313.83 MB/sec
146 128
147 STREAM copy latency: 0.82 nanoseconds
148 STREAM copy bandwidth: 19432.13 MB/sec
149 STREAM scale latency: 0.87 nanoseconds
150 STREAM scale bandwidth: 18421.31 MB/sec
151 STREAM add latency: 0.98 nanoseconds
152 STREAM add bandwidth: 24546.03 MB/sec
153 STREAM triad latency: 1.06 nanoseconds
154 STREAM triad bandwidth: 22702.59 MB/sec
155 136
156 STREAM copy latency: 0.74 nanoseconds
157 STREAM copy bandwidth: 21568.01 MB/sec
158 STREAM scale latency: 0.74 nanoseconds
159 STREAM scale bandwidth: 21668.99 MB/sec
160 STREAM add latency: 0.90 nanoseconds
161 STREAM add bandwidth: 26697.59 MB/sec
162 STREAM triad latency: 0.91 nanoseconds
163 STREAM triad bandwidth: 26320.64 MB/sec
164 144
165 STREAM copy latency: 0.79 nanoseconds
166 STREAM copy bandwidth: 20268.45 MB/sec
167 STREAM scale latency: 0.66 nanoseconds
168 STREAM scale bandwidth: 24279.61 MB/sec
169 STREAM add latency: 0.89 nanoseconds
170 STREAM add bandwidth: 26822.08 MB/sec
171 STREAM triad latency: 0.84 nanoseconds
172 STREAM triad bandwidth: 28540.76 MB/sec
173 152
174 STREAM copy latency: 0.85 nanoseconds
175 STREAM copy bandwidth: 18903.90 MB/sec
176 STREAM scale latency: 0.56 nanoseconds
177 STREAM scale bandwidth: 28734.25 MB/sec
178 STREAM add latency: 0.88 nanoseconds
179 STREAM add bandwidth: 27335.58 MB/sec
180 STREAM triad latency: 0.75 nanoseconds
181 STREAM triad bandwidth: 31911.01 MB/sec
182 160
183 STREAM copy latency: 0.64 nanoseconds
184 STREAM copy bandwidth: 25068.68 MB/sec
185 STREAM scale latency: 0.63 nanoseconds
186 STREAM scale bandwidth: 25550.68 MB/sec
187 STREAM add latency: 0.88 nanoseconds
188 STREAM add bandwidth: 27313.33 MB/sec
189 STREAM triad latency: 0.82 nanoseconds
190 STREAM triad bandwidth: 29416.50 MB/sec
191 168
192 STREAM copy latency: 0.61 nanoseconds
193 STREAM copy bandwidth: 26232.33 MB/sec
194 STREAM scale latency: 0.60 nanoseconds
195 STREAM scale bandwidth: 26717.96 MB/sec
196 STREAM add latency: 0.88 nanoseconds
197 STREAM add bandwidth: 27398.82 MB/sec
198 STREAM triad latency: 0.79 nanoseconds
199 STREAM triad bandwidth: 30411.86 MB/sec
200 176
201 STREAM copy latency: 0.58 nanoseconds
202 STREAM copy bandwidth: 27380.19 MB/sec
203 STREAM scale latency: 0.58 nanoseconds
204 STREAM scale bandwidth: 27740.96 MB/sec
205 STREAM add latency: 0.94 nanoseconds
206 STREAM add bandwidth: 25666.31 MB/sec
207 STREAM triad latency: 0.77 nanoseconds
208 STREAM triad bandwidth: 31150.63 MB/sec
209 184
210 STREAM copy latency: 0.90 nanoseconds
211 STREAM copy bandwidth: 17730.21 MB/sec
212 STREAM scale latency: 0.57 nanoseconds
213 STREAM scale bandwidth: 27918.40 MB/sec
214 STREAM add latency: 0.87 nanoseconds
215 STREAM add bandwidth: 27458.61 MB/sec
216 STREAM triad latency: 0.76 nanoseconds
217 STREAM triad bandwidth: 31457.27 MB/sec
218 192
219 STREAM copy latency: 0.91 nanoseconds
220 STREAM copy bandwidth: 17558.57 MB/sec
221 STREAM scale latency: 0.88 nanoseconds
222 STREAM scale bandwidth: 18115.49 MB/sec
223 STREAM add latency: 0.92 nanoseconds
224 STREAM add bandwidth: 26031.36 MB/sec
225 STREAM triad latency: 1.12 nanoseconds
226 STREAM triad bandwidth: 21443.95 MB/sec
227 200
228 STREAM copy latency: 1.34 nanoseconds
229 STREAM copy bandwidth: 11911.40 MB/sec
230 STREAM scale latency: 0.85 nanoseconds
231 STREAM scale bandwidth: 18893.26 MB/sec
232 STREAM add latency: 0.91 nanoseconds
233 STREAM add bandwidth: 26306.88 MB/sec
234 STREAM triad latency: 1.09 nanoseconds
235 STREAM triad bandwidth: 22013.73 MB/sec
236 208
237 STREAM copy latency: 1.36 nanoseconds
238 STREAM copy bandwidth: 11724.12 MB/sec
239 STREAM scale latency: 0.86 nanoseconds
240 STREAM scale bandwidth: 18631.00 MB/sec
241 STREAM add latency: 0.92 nanoseconds
242 STREAM add bandwidth: 26166.69 MB/sec
243 STREAM triad latency: 1.10 nanoseconds
244 STREAM triad bandwidth: 21763.86 MB/sec
245 216
246 STREAM copy latency: 0.88 nanoseconds
247 STREAM copy bandwidth: 18270.85 MB/sec
248 STREAM scale latency: 0.85 nanoseconds
249 STREAM scale bandwidth: 18848.15 MB/sec
250 STREAM add latency: 0.92 nanoseconds
251 STREAM add bandwidth: 26176.90 MB/sec
252 STREAM triad latency: 1.10 nanoseconds
252 STREAM triad bandwidth: 21799.20 MB/sec
254 224
255 STREAM copy latency: 0.89 nanoseconds
256 STREAM copy bandwidth: 18047.29 MB/sec
257 STREAM scale latency: 0.86 nanoseconds
258 STREAM scale bandwidth: 18677.66 MB/sec
259 STREAM add latency: 0.92 nanoseconds
260 STREAM add bandwidth: 26112.39 MB/sec
261 STREAM triad latency: 1.09 nanoseconds
262 STREAM triad bandwidth: 21966.89 MB/sec
263 232
264 STREAM copy latency: 1.35 nanoseconds
265 STREAM copy bandwidth: 11818.58 MB/sec
266 STREAM scale latency: 0.82 nanoseconds
267 STREAM scale bandwidth: 19568.11 MB/sec
268 STREAM add latency: 0.91 nanoseconds
269 STREAM add bandwidth: 26469.44 MB/sec
270 STREAM triad latency: 1.06 nanoseconds
271 STREAM triad bandwidth: 22702.59 MB/sec
272 240
273 STREAM copy latency: 0.87 nanoseconds
274 STREAM copy bandwidth: 18325.74 MB/sec
275 STREAM scale latency: 0.83 nanoseconds
276 STREAM scale bandwidth: 19331.37 MB/sec
277 STREAM add latency: 0.91 nanoseconds
278 STREAM add bandwidth: 26455.52 MB/sec
279 STREAM triad latency: 1.06 nanoseconds
280 STREAM triad bandwidth: 22580.37 MB/sec
281 248
282 STREAM copy latency: 0.87 nanoseconds
283 STREAM copy bandwidth: 18418.79 MB/sec
284 STREAM scale latency: 0.84 nanoseconds
285 STREAM scale bandwidth: 19019.09 MB/sec
286 STREAM add latency: 0.91 nanoseconds
287 STREAM add bandwidth: 26483.37 MB/sec
288 STREAM triad latency: 1.08 nanoseconds
289 STREAM triad bandwidth: 22148.13 MB/sec


stream对比数据

总结下几个CPU用stream测试访问内存的RT以及抖动和带宽对比数据,重点关注带宽,这个测试中时延不重要


最小RT最大RT最大copy bandwidth最小copy bandwidth
申威3231(2numa node)7.098.752256.59 MB/sec1827.88 MB/sec
飞腾2500(16 numa node)2.8410.345638.21 MB/sec1546.68 MB/sec
鲲鹏920(4 numa node)1.843.878700.75 MB/sec4131.81 MB/sec
海光7280(8 numa node)1.382.5811591.48 MB/sec6206.99 MB/sec
海光5280(4 numa node)1.222.5213166.34 MB/sec6357.71 MB/sec
Intel8269CY(2 numa node)1.121.5214293.68 MB/sec10551.71 MB/sec
Intel E5-2682(2 numa node)1.582.0210092.31 MB/sec7914.25 MB/sec
AMD EPYC 7T83(4 numa node)0.491.3932561.30 MB/sec11512.93 MB/sec
Y71.833.488764.72 MB/sec4593.25 MB/sec

从以上数据可以看出这5款CPU性能一款比一款好,飞腾2500慢的core上延时快到intel 8269的10倍了,平均延时5倍以上了。延时数据基本和单核上测试sysbench TPS一致。性能差不多就是:常数 * 主频/RT。

lat_mem_rd 对比数据

用不同的node上的core 跑lat_mem_rd测试访问node0内存的RT,只取最大64M的时延,时延和node距离完全一致


RT变化
飞腾2500(16 numa node)core:0 149.976
core:8 168.805
core:16 191.415
core:24 178.283
core:32 170.814
core:40 185.699
core:48 212.281
core:56 202.479
core:64 426.176
core:72 444.367
core:80 465.894
core:88 452.245
core:96 448.352
core:104 460.603
core:112 485.989
core:120 490.402
鲲鹏920(4 numa node)core:0 117.323
core:24 135.337
core:48 197.782
core:72 219.416
海光7280(8 numa node)numa0 106.839
numa1 168.583
numa2 163.925
numa3 163.690
numa4 289.628
numa5 288.632
numa6 236.615
numa7 291.880
分割行
enabled die interleaving
core:0 153.005
core:16 152.458
core:32 272.057
core:48 269.441
海光5280(4 numa node)core:0 102.574
core:8 160.989
core:16 286.850
core:24 231.197
海光7260(1 numa node)core:0 265
Intel 8269CY(2 numa node)core:0 69.792
core:26 93.107
申威3231(2numa node)core:0 215.146
core:32 282.443
AMD EPYC 7T83(4 numa node)core:0 71.656
core:32 80.129
core:64 131.334
core:96 129.563
Y7(2Die,2node,1socket)core:8 42.395
core:40 36.434
core:104 105.745
core:88 124.384

core:24 62.979
core:8 69.324
core:64 137.233
core:88 127.250

133ns 205ns (待测)

测试命令:

1 for i in $(seq 0 8 127); do echo core:$i; numactl -C $i -m 0 ./bin/lat_mem_rd -W 5 -N 5 -t 64M; done >lat.log 2>&1

测试结果和numactl -H 看到的node distance完全一致,芯片厂家应该就是这样测试然后把这个延迟当做距离写进去了

AMD EPYC 7T83(4 numa node)的时延相对抖动有点大,这和架构多个小Die合并成一块CPU有关。

1 grep -E "core|64.00000" lat.log
2 core:0
3 64.00000 71.656
4 core:32
5 64.00000 80.129
6 core:64
7 64.00000 131.334
8 core:88
9 64.00000 136.774
10 core:96
11 64.00000 129.563
12 core:120
13 64.00000 140.151




AMD EPYC 7T83(4 numa node)比Intel 8269时延要大,但是带宽也高很多

龙芯测试数据

3A5000为龙芯,执行的命令为./lat_mem_rd  128M 4096,其中 4096 参数为跳步大小。其基本原理是,通过按  给定间隔去循环读一定大小的内存区域,测量每个读平均的时间。如果区域大小小于 L1 Cache 大 小,时间应该接近 L1 的访问延迟;如果大于  L1 小于 L2,则接近 L2 访问延迟;依此类推。图中横坐 标为访问的字节数,纵坐标为访存的拍数(cycles)。

基于跳步访问的 3A5000 和 Zen1、Skylake 各级延迟的比较(cycles)

下图给出了  LMbench 测试得到的访存操作的并发性,执行的命令为./par_mem。访存操作的并 发性是各级 Cache  和内存所支持并发访问的能力。在 LMbench 中,访存操作并发性的测试是设计一  个链表,不断地遍历访问下一个链表中的元素,链表所跳的距离和需要测量的 Cache 容量相关,在  一段时间能并发的发起对链表的追逐操作,也就是同时很多链表在遍历,如果发现这一段时间内 能同时完成 N 个链表的追逐操作,就认为访存的并发操作是  N。

下图列出了三款处理器的功能部件操作延迟数据,使用的命令是./lat_ops。

龙芯stream数据

LMbench  包含了 STREAM 带宽测试工具,可以用来测试可持续的内存访问带宽情况。图表12.25列 出了三款处理器的 STREAM 带宽数据,其中  STREAM 数组大小设置为 1 亿个元素,采用 OpenMP 版本 同时运行四个线程来测试满载带宽;相应测试平台均为 CPU  的两个内存控制器各接一根内存条, 3A5000 和 Zen1 用 DDR4 3200 内存条,Skylake 用 DDR4 2400  内存条(它最高只支持这个规格)。

从数据可以看到,虽然硬件上  3A5000 和 Zen1 都实现了 DDR4 3200,但 3A5000 的实测可持续带宽  还是有一定差距。用户程序看到的内存带宽不仅仅和内存的物理频率有关系,也和处理器内部的  各种访存队列、内存控制器的调度策略、预取器和内存时序参数设置等相关,需要进行更多分析 来定位具体的瓶颈点。像 STREAM  这样的软件测试工具,能够更好地反映某个子系统的综合能力, 因而被广泛采用。

对比结论

  • AMD单核跑分数据比较好

  • MySQL 查询场景下Intel的性能好很多

  • xdb比社区版性能要好

  • MySQL8.0比5.7在多核锁竞争场景下性能要好

  • intel最好,AMD接近Intel,海光差的比较远但是又比鲲鹏好很多,飞腾最差,尤其是跨socket简直是灾难

  • 麒麟OS性能也比CentOS略差一些

  • 从perf指标来看 鲲鹏920的L1d命中率高于8163是因为鲲鹏L1 size大;L2命中率低于8163,同样是因为鲲鹏 L2 size小;同样L1i 鲲鹏也大于8163,但是实际跑起来L1i Miss Rate更高,这说明 ARM对 L1d 使用效率低

整体来说AMD用领先了一代的工艺(7nm VS 14nm),在MySQL查询场景中终于可以接近Intel了,但是海光、鲲鹏、飞腾还是不给力。

附表

鲲鹏920 和 8163 在 MySQL 场景下的 perf 指标对比

整体对比


指标X86ARM增加幅度
IPC0.49790.495-0.6%
Branchs23760641477241597989498575.1%
Branch-misses810424762028983836845257.6%
Branch-missed rate0.0340.070104.3%
内存读带宽(GB/S)25.025.0-0.2%
内存写带宽(GB/S)24.667.8175.5%
内存读写带宽(GB/S)49.792.886.8%
UNALIGNED_ACCESS132914664513686011901929.7%
L1d_MISS_RATIO0.060550.04281-29.3%
L1d_MISS_RATE0.016450.017114.0%
L2_MISS_RATIO0.348240.4716235.4%
L2_MISS_RATE0.005770.03493504.8%
L1_ITLB_MISS_RATE0.00280.00578.6%
L1_DTLB_MISS_RATE0.00250.0102308.0%
context-switchs84071951161498138.2%
Pagefault228371741189224.6%

 

服务热线

1391-024-6332