侧边栏壁纸
博主头像
落叶人生博主等级

走进秋风,寻找秋天的落叶

  • 累计撰写 130562 篇文章
  • 累计创建 28 个标签
  • 累计收到 9 条评论
标签搜索

目 录CONTENT

文章目录

aerospike CLUSTER INTEGRITY FAULT 问题分析

2023-12-02 星期六 / 0 评论 / 0 点赞 / 92 阅读 / 6285 字

1.问题表现 version 3.5.9Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check:

1.问题表现 version 3.5.9

Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb900007f14eb4b and self bb9ffe723270008Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb900007f14eb4bDec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9ffe723270008 retransmitting sync messages to nodes that have not responded yet ... Dec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb900007f14eb4bDec 29 2016 07:46:58 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [9.0]@bb9ffe723270008: bb9ffe723270008 bb900007f14eb4b Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2367) Cluster Integrity Check: Detected succession list discrepancy between node bb900007f14eb4b and self bb9ffe723270008Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2412) CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=bb900007f14eb4bDec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::2516) as_paxos_retransmit_check: principal bb9ffe723270008 retransmitting sync messages to nodes that have not responded yet ... Dec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::1439) sending sync message to bb900007f14eb4bDec 29 2016 07:47:03 GMT: INFO (paxos): (paxos.c::1448) SUCCESSION [9.0]@bb9ffe723270008: bb9ffe723270008 bb900007f14eb4b 

2.Cluster Integrity Check

	// for each node in the succession list	// compare the node's succession list with this server's succession list	bool cluster_integrity_fault = false;	bool are_nodes_not_dunned = false;	for (int i = 0; i < g_config.paxos_max_cluster_size; i++) {		cf_debug(AS_PAXOS, "Cluster Integrity Check: %d, %"PRIx64"", i, succ_list_index[i]);		if (succ_list_index[i] == (cf_node) 0) {			break; // we are done		}

3.CLUSTER INTEGRITY FAULT

	switch (g_config.paxos_recovery_policy) {			case AS_PAXOS_RECOVERY_POLICY_MANUAL:			{				if (are_nodes_not_dunned) {					snprintf(sbuf, 97, "CLUSTER INTEGRITY FAULT. [Phase 1 of 2] To fix, issue this command across all nodes:  dun:nodes=");				} else {					snprintf(sbuf, 99, "CLUSTER INTEGRITY FAULT. [Phase 2 of 2] To fix, issue this command across all nodes:  undun:nodes=");				}				bool nodes_missing = false;				for (int i = 0; i < g_config.paxos_max_cluster_size; i++) {					if ((cf_node)0 == missing_nodes[i]) {						break;					}					snprintf(sbuf + strlen(sbuf), 18, "%"PRIx64",", missing_nodes[i]);					nodes_missing = true;				}

4.原因分析

只要出现两个节点间不能互相通过3002端口同步状态,就会出现上述问题

导致该问题的原因有很多种

①防火墙

#验证方法telnet ip:port

②进程fd耗尽,导致无法创建socket

#as默认fd数量aerospike.confproto-fd-max 15000#验证方法ll /proc/pid/fd|grep socket  |wc-llsof -p  asd-pid|grep can't identify protocol|wc -l
100 BB9FFE723270008  192.168.56.100101 BB900007F80090B  192.168.56.101101 能连接100  100无法连接101就出现上面 as集群各节点状态确认问题101节点可以连接100的3002端口 但100节点无法连接101的3002端口

101连接100的3002

[root@c101 ~]# telnet 192.168.56.100 3002Trying 192.168.56.100...Connected to 192.168.56.100.Escape character is '^]'.Mhc

100连接101的3002

[root@c100 ~]# telnet 192.168.56.101 3002Trying 192.168.56.101...telnet: connect to address 192.168.56.101: No route to host[root@c100 ~]# 

100节点网络状态

[root@c100 ~]# netstat -natActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address           Foreign Address         State      tcp        0      0 0.0.0.0:3001            0.0.0.0:*               LISTEN     tcp        0      0 192.168.56.100:3002     0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:3003            0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN     tcp        0      0 192.168.56.100:3002     192.168.56.101:58930    ESTABLISHEDtcp        0     52 192.168.56.100:22       192.168.56.1:52622      ESTABLISHEDtcp        0      0 192.168.56.100:22       192.168.56.1:52188      ESTABLISHEDtcp        0      0 192.168.56.100:22       192.168.56.1:52031      ESTABLISHEDtcp6       0      0 :::3306                 :::*                    LISTEN     tcp6       0      0 :::22                   :::*                    LISTEN     [root@c100 ~]# 

101节点网络状态

[root@c101 ~]# netstat -natActive Internet connections (servers and established)Proto Recv-Q Send-Q Local Address           Foreign Address         State      tcp        0      0 0.0.0.0:3001            0.0.0.0:*               LISTEN     tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN     tcp        0      0 192.168.56.101:3002     0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:3003            0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:3000            0.0.0.0:*               LISTEN     tcp        0      0 192.168.56.101:22       192.168.56.1:55739      ESTABLISHEDtcp        0      0 192.168.56.101:58930    192.168.56.100:3002     ESTABLISHEDtcp        0     52 192.168.56.101:22       192.168.56.1:55723      ESTABLISHEDtcp        0      0 192.168.56.101:22       192.168.56.1:55135      ESTABLISHEDtcp6       0      0 ::1:25                  :::*                    LISTEN     tcp6       0      0 :::22                   :::*                    LISTEN     [root@c101 ~]# 

5.问题还原重现方法

节点A可以连接节点B,节点B无法连接节点A将节点A的防火墙打开即可

广告 广告

评论区