故障现象: EDAC sbridge: Lost 1569 memory errors (内存校验错误) 处理过程: 查看日志:cat /var/log/messages Feb 27 10:18
故障现象:
EDAC sbridge: Lost 1569 memory errors (内存校验错误)
处理过程:
查看日志:cat /var/log/messages
Feb 27 10:18:02 localhost kernel: EDAC MC1: 864 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1a21 offset:0xc00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:0 channel_mask:1 rank:0)Feb 27 10:18:02 localhost kernel: EDAC MC1: 864 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1a21 offset:0xc00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:0 channel_mask:1 rank:0)Feb 27 10:18:02 localhost kernel: EDAC MC1: 422 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1a2e offset:0x400 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0092 socket:0 channel_mask:1 rank:0)
这个是[EDAC (Error Detection AndCorrection)](https://www.kernel.org/doc/Documentation/edac.txt) 的日志。CE Error 是 Correctable Error 的简称,另外还有 UE(Uncorrectable Error)
按照上面的文档, 找出错误的DIMM:
[root@localhost ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count[root@localhost ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:2/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch3_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:234324233/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:545665534/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:554836518/sys/devices/system/edac/mc/mc1/csrow1/ch3_ce_count:0
查到是 /mc1/csrow1/ch2, 根据结构图:
Channel 0 Channel 1
===================================
csrow0 | DIMM_A0 | DIMM_B0 |
csrow1 | DIMM_A0 | DIMM_B0 |
===================================
===================================
csrow2 | DIMM_A1 | DIMM_B1 |
csrow3 | DIMM_A1 | DIMM_B1 |
===================================
然后通过dmidecode查看:
[root@localhost ~]# dmidecode -t memory |grep 'Locator: DIMM'Locator: DIMM1Locator: DIMM2Locator: DIMM_1/3Locator: DIMM3Locator: DIMM4Locator: DIMM_2/3Locator: DIMM5Locator: DIMM6Locator: DIMM_3/3Locator: DIMM7Locator: DIMM8Locator: DIMM_4/3Locator: DIMM1Locator: DIMM2Locator: DIMM_1/3Locator: DIMM3Locator: DIMM4Locator: DIMM_2/3Locator: DIMM5Locator: DIMM6Locator: DIMM_3/3Locator: DIMM7Locator: DIMM8Locator: DIMM_4/3
通过服务器控制台查看内存(如下图。注此图属网络转载):
主板上内存插槽的分布:
结论:
出现”[ 63.194496] EDAC sbridge: Lost 1569 memory errors“此错误导致服务器重启,最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。