侧边栏壁纸
博主头像
落叶人生博主等级

走进秋风,寻找秋天的落叶

  • 累计撰写 130562 篇文章
  • 累计创建 28 个标签
  • 累计收到 9 条评论
标签搜索

目 录CONTENT

文章目录

EDAC DIMM CE Error 错误导致服务器重启

2023-11-23 星期四 / 0 评论 / 0 点赞 / 35 阅读 / 4205 字

故障现象: EDAC sbridge: Lost 1569 memory errors (内存校验错误) 处理过程: 查看日志:cat /var/log/messages Feb 27 10:18

故障现象:

EDAC sbridge: Lost 1569 memory errors (内存校验错误)

处理过程:

查看日志:cat /var/log/messages

Feb 27 10:18:02 localhost kernel: EDAC MC1: 864 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1a21 offset:0xc00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 channel_mask:1 rank:0)Feb 27 10:18:02 localhost kernel: EDAC MC1: 864 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1a21 offset:0xc00 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 channel_mask:1 rank:0)Feb 27 10:18:02 localhost kernel: EDAC MC1: 422 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x1a2e offset:0x400 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 channel_mask:1 rank:0)

这个是[EDAC (Error Detection AndCorrection)](https://www.kernel.org/doc/Documentation/edac.txt) 的日志。CE Error 是 Correctable Error 的简称,另外还有 UE(Uncorrectable Error)

按照上面的文档, 找出错误的DIMM:

[root@localhost ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count[root@localhost ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:2/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0/sys/devices/system/edac/mc/mc0/csrow1/ch3_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch0_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch2_ce_count:0/sys/devices/system/edac/mc/mc1/csrow0/ch3_ce_count:0/sys/devices/system/edac/mc/mc1/csrow1/ch0_ce_count:234324233/sys/devices/system/edac/mc/mc1/csrow1/ch1_ce_count:545665534/sys/devices/system/edac/mc/mc1/csrow1/ch2_ce_count:554836518/sys/devices/system/edac/mc/mc1/csrow1/ch3_ce_count:0

 

查到是 /mc1/csrow1/ch2, 根据结构图:

  Channel 0   Channel 1

===================================

csrow0 | DIMM_A0   | DIMM_B0 |

csrow1 | DIMM_A0   | DIMM_B0 |

===================================

===================================

csrow2 | DIMM_A1   | DIMM_B1 |

csrow3 | DIMM_A1   | DIMM_B1 |

===================================

 

然后通过dmidecode查看:

[root@localhost ~]# dmidecode -t memory |grep 'Locator: DIMM'Locator: DIMM1Locator: DIMM2Locator: DIMM_1/3Locator: DIMM3Locator: DIMM4Locator: DIMM_2/3Locator: DIMM5Locator: DIMM6Locator: DIMM_3/3Locator: DIMM7Locator: DIMM8Locator: DIMM_4/3Locator: DIMM1Locator: DIMM2Locator: DIMM_1/3Locator: DIMM3Locator: DIMM4Locator: DIMM_2/3Locator: DIMM5Locator: DIMM6Locator: DIMM_3/3Locator: DIMM7Locator: DIMM8Locator: DIMM_4/3

 

通过服务器控制台查看内存(如下图。注此图属网络转载):

主板上内存插槽的分布:

结论:

出现”[ 63.194496] EDAC sbridge: Lost 1569 memory errors“此错误导致服务器重启,最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。

 

广告 广告

评论区