资讯详情

实例集群状态为Fail导致的集群混乱排查和复现

背景:公司缓存管理云平台的实例克隆功能(机器停机后,其他服务器启动停机实例,然后与原集群进行主从关系维护),复制后出现集群混乱,初步定位为集群离线实例。forget这个例子仍然存在于下线节点的集群中ip:port信息,但状态是fail状态,当该ip当港口再次启动并加入其他集群时,原集群自动召回,导致集群混乱。

有几个问题需要确认:

  1. 集群是否会有脏数据(指不同集群冲突槽的数据是否混淆)
  2. 集群混乱的原因是什么?
  3. 集群混乱后,槽位和数据覆盖的逻辑

一、环境准备

准备4台CENTOS虚拟机服务器,ip设置如下。

10.4.7.221 10.4.7.222 10.4.7.223 10.4.7.224 

集群规划如下,三个集群分别部署在221、222、223台机器上,224台机器作为克隆机器。集群应有相应的规则,如 以8001为主,以8011为主,一方面方便编写启动和集群匹配的脚本,另一方面方便集群混乱后查看集群节点。

A 3主3从

10.4.7.221:8001 10.4.7.222:8011  10.4.7.222:8002 10.4.7.223:8012  10.4.7.223:8003 10.4.7.221:8013 

B 4主4从

10.4.7.221:9001 10.4.7.222:9011  10.4.7.222:9002 10.4.7.223:9012  10.4.7.223:9003 10.4.7.221:9013  10.4.7.221:9004 10.4.7.222:9014 

C 5主5从

10.4.7.221:7001 10.4.7.222:7011  10.4.7.222:7002 10.4.7.223:7012  10.4.7.223:7003 10.4.7.221:7013  10.4.7.221:7004 10.4.7.222:7014  10.4.7.222:7005 10.4.7.223:7015 

目录规划

日志文件目录 /app/cachecloud/logs 配置文件目录 /app/cachecloud/conf 数据文件目录 /app/cachecloud/data 

配置文件模板,创建所有计划实例的配置文件/app/cachecloud/conf目录下,redis省略安装编译步骤。

port 8001 cluster-enabled yes cluster-node-timeout 15000 cluster-config-file "nodes-8001.conf" bind 0.0.0.0 daemonize yes logfile "/app/cachecloud/logs/redis-a-8001.log" dir /app/cachecloud/data dbfilename dump-8001.rdb 

分别调整配置

#关闭linux大页设置 echo never > /sys/kernel/mm/transparent_hugepage/enabled #无论当前的内存状态如何,内核允许分配所有的物理内存。 echo 1 > /proc/sys/vm/overcommit_memory sysctl vm.overcommit_memory=1 #关闭防火墙(否则集群通信失败很重要) systemctl stop firewalld 

重新配置集群,删除配置(kill redis过程,删除缓存集群nodes配置文件、日志文件、数据文件)

cd /app/cachecloud/conf;ps -ef | grep redis-server | grep -v grep | awk '{print $2}' | xargs kill;rm -f dump.rdb nodes-*.conf;rm -f /app/cachecloud/logs/*.log;rm -f /app/cachecloud/data/*.rdb 

221、222、223分别执行以下启动实例

cd /app/cachecloud/conf;for i in `ls -l | grep redis | awk '{print $9}'`;do echo `redis-server $i`; done; 

221服务器执行meet集群节点命令

#集群A redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.221 8001
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.222 8011
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.222 8002
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.223 8012
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.223 8003
redis-cli -h 10.4.7.221 -p 8001 cluster meet 10.4.7.221 8013

#集群B
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.221 9001
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.222 9011
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.222 9002
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.223 9012
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.223 9003
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.221 9013
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.221 9004
redis-cli -h 10.4.7.221 -p 9001 cluster meet 10.4.7.222 9014

集群C
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.221 7001
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7011
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7002
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 7012
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 7003
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.221 7013
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.221 7004
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7014
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.222 7005
redis-cli -h 10.4.7.221 -p 7001 cluster meet 10.4.7.223 7015

给三个集群分配槽位,集群实例分配的槽位细碎一些

redis-cli -h 10.4.7.221 -p 8001 cluster addslots { 
        0..5555}
redis-cli -h 10.4.7.222 -p 8002 cluster addslots { 
        5556..11112}
redis-cli -h 10.4.7.223 -p 8003 cluster addslots { 
        11113..16383}

redis-cli -h 10.4.7.221 -p 9001 cluster addslots { 
        0..4096}
redis-cli -h 10.4.7.222 -p 9002 cluster addslots { 
        4097..8192}
redis-cli -h 10.4.7.223 -p 9003 cluster addslots { 
        8193..12288}
redis-cli -h 10.4.7.221 -p 9004 cluster addslots { 
        12289..16383}

redis-cli -h 10.4.7.221 -p 7001 cluster addslots { 
        0..3278}
redis-cli -h 10.4.7.222 -p 7002 cluster addslots { 
        3279..6556}
redis-cli -h 10.4.7.223 -p 7003 cluster addslots { 
        6557..9834}
redis-cli -h 10.4.7.221 -p 7004 cluster addslots { 
        9835..13112}
redis-cli -h 10.4.7.222 -p 7005 cluster addslots { 
        13113..16383}

221执行主从对应

masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep :8001 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 8011 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep :8002 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 8012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 8001 cluster nodes | grep :8003 | awk '{print $1}'`;redis-cli -h 10.4.7.221 -p 8013 cluster replicate $masternodeid;

masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9001 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 9011 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9002 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 9012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9003 | awk '{print $1}'`;redis-cli -h 10.4.7.221 -p 9013 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 9001 cluster nodes | grep :9004 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 9014 cluster replicate $masternodeid;

masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7001 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 7011 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7002 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 7012 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7003 | awk '{print $1}'`;redis-cli -h 10.4.7.221 -p 7013 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7004 | awk '{print $1}'`;redis-cli -h 10.4.7.222 -p 7014 cluster replicate $masternodeid;
masternodeid=`redis-cli -h 10.4.7.221 -p 7001 cluster nodes | grep :7005 | awk '{print $1}'`;redis-cli -h 10.4.7.223 -p 7015 cluster replicate $masternodeid;

集群准备好以后,查看集群的状态,每次执行操作以后都要记录一下,便于比对实例的变更状态。

redis-cli -p 8001 cluster nodes 
redis-cli -p 9001 cluster nodes
redis-cli -p 7001 cluster nodes


[root@redis-7-221 conf]# redis-cli -p 8001 cluster nodes
b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 10.4.7.222:8002 master - 0 1648189621790 3 connected 5556-11112
ee340787e8a39a59f9169386bef30d70d2f1fb62 10.4.7.222:8011 slave b850d083307af116b1f61401cb7d01e2d162ad38 0 1648189623797 1 connected
b850d083307af116b1f61401cb7d01e2d162ad38 10.4.7.221:8001 myself,master - 0 0 1 connected 0-5555
3e42b6080162017b33ae87caac67daad9eb77046 10.4.7.221:8013 slave e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 0 1648189624800 5 connected
0afd6dd77395042026c1e358af7dd535fd4a5f0d 10.4.7.223:8012 slave b03eb3e887c67a6305c7a205d5bfc4d6e222dd74 0 1648189623295 4 connected
e656d5a698046fcc4ccd341f0b9bdaf172f2d5ed 10.4.7.223:8003 master - 0 1648189622794 5 connected 11113-16383

[root@redis-7-221 conf]# redis-cli -p 9001 cluster nodes
cc4191deafddf42db21ee3cdab4f906931237ca9 10.4.7.223:9012 slave 5f201e00106a512ab3a3d73455ee1269b367b204 0 1648189622893 5 connected
37dc860d2cc031c7d0a052bb34c07ad1625cbe8f 10.4.7.222:9011 slave 335a0cb8d9d82a764a19bf71da6379b73c703e95 0 1648189619884 7 connected
5f201e00106a512ab3a3d73455ee1269b367b204 10.4.7.222:9002 master - 0 1648189620888 4 connected 4097-8192
577e5ea9c7f56c3767cfcfa23905742d97448b02 10.4.7.221:9013 slave ca7e61ba84d005dc32570377a7e57cdc2b1ea395 0 1648189617878 3 connected
097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 10.4.7.221:9004 master - 0 1648189623898 6 connected 12289-16383
dcf4c5f00f922e67094ebd0bf0cfe61701c0c3fe 10.4.7.222:9014 slave 097271ce6ad0d7c5b4e5b80a645058bd2bb0099f 0 1648189621891 6 connected
ca7e61ba84d005dc32570377a7e57cdc2b1ea395 10.4.7.223:9003 master - 0 1648189622392 2 connected 8193-12288
335a0cb8d9d82a764a19bf71da6379b73c703e95 10.4.7.221:9001 myself,master - 0 0 1 connected 0-4096

[root@redis-7-221 conf]# redis-cli -p 7001 cluster nodes
87a0bf02cbc2e7eebee500840a346dc7d4fe2603 10.4.7.222:7002 master - 0 1648189621892 4 connected 3279-6556
e4aa6c4158b31d8d5f1aa048571f725522d521d7 10.4.7.221:7001 myself,master - 0 0 0 connected 0-3278
387564ccc40c9d38a335f9af6276fa4545456c3c 10.4.7.222:7014 slave 1f984fd46a8e343d9c421fbab04f1f5696e059d4 0 1648189623900 9 connected
94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 10.4.7.223:7003 master - 0 1648189624904 6 connected 6557-9834
955014596d1929366f655ad6a020195432da917e 10.4.7.221:7013 slave 94985c7576c2a0ab05e3cb3bd3b63c75b137ddf3 0 1648189620887 6 connected
033229faffaede7d873acdc6d7c20a580ee9aa66 10.4.7.223:7015 slave 3df4258d28c93acb80444e56a330c5331681d413 0 1648189620387 7 connected
3db9f8bb3bbadd6ddaacf61d9d2b0fd6cd883e54 10.4.7.223:7012 slave 87a0bf02cbc2e7eebee500840a346dc7d4fe2603 0 1648189622393 5 connected
1f984fd46a8e343d9c421fbab04f1f5696e059d4 10.4.7.221:7004 master - 0 1648189622894 9 connected 9835-13112
3df4258d28c93acb80444e56a330c5331681d413 10.4.7.222:7005 master - 0 1648189622394 3 connected 13113-16383
966e18cfe9f02771752bac772918ec872d00bfb9 10.4.7.222:7011 slave e4aa6c4158b31d8d5f1aa048571f725522d521d7 0 1648189619886 2 connected

执行python脚本向3个集群中各插入100w条数据,每个集群的数据要有自己的规则,便于查看集群混乱后的数据分布状况。

import redis
import time
import traceback
import random
from time import ctime, sleep
from rediscluster import StrictRedisCluster

# 生成随机字符串
def generate_random_str(randomlength=16):
    random_str = ''
    base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
    length = len(base_str) - 1
    for i in range(randomlength):
        random_str += base_str[random.randint(0, length)]
    return random_str

def generate_random_num(randomlength=100):
    return random.randint(0, randomlength)

# 集群A
startup_nodes = [
    { 
        "host": "10.4.7.221", "port": 8001},
    { 
        "host": "10.4.7.222", "port": 8011},
    { 
        "host": "10.4.7.222", "port": 8002},
    { 
        "host": "10.4.7.223", "port": 8012},
    { 
        "host": "10.4.7.223", "port": 8003},
    { 
        "host": "10.4.7.221", "port": 8013}
]

redis_conn = StrictRedisCluster(
    startup_nodes=startup_nodes, decode_responses=True, password='')
p = redis_conn.pipeline()
for i in range(0, 1000000):
    p.set('cluster:A:hello_'+str(i).zfill(8), generate_random_str(10))
    if i % 1000 == 0:
        p.execute()
        print("========>executed:{}".format(i))
import redis
import time
import traceback
import random
from time import ctime, sleep
from rediscluster import StrictRedisCluster

# 生成随机字符串
def generate_random_str(randomlength=16):
    random_str = ''
    base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
    length = len(base_str) - 1
    for i in range(randomlength):
        random_str += base_str[random.randint(0, length)]
    return random_str

def generate_random_num(randomlength=100):
    return random.randint(0, randomlength)

# 集群B
startup_nodes = [
    { 
        "host": "10.4.7.221","port": 9001},
    { 
        "host": "10.4.7.222","port": 9011},
    { 
        "host": "10.4.7.222","port": 9002},
    { 
        "host": "10.4.7.223","port": 9012},
    { 
        "host": "10.4.7.223","port": 9003},
    { 
        "host": "10.4.7.221","port": 9013},
    { 
        "host": "10.4.7.221","port": 9004},
    { 
        "host": "10.4.7.222","port": 9014}
]

redis_conn = StrictRedisCluster(
    startup_nodes=startup_nodes, decode_responses=True, password='')

p = redis_conn.pipeline()
for i in range(0, 1000000):
    # 字符串
    p.set('cluster:B:hello_'+str(i).zfill(8), generate_random_str(10))
    if i % 1000 == 0:
        p.execute()
        print("========>executed:{}".format(i))

import redis
import time
import traceback
import random
from time import ctime, sleep
from rediscluster import StrictRedisCluster

# 生成随机字符串
def generate_random_str(randomlength=16):
    random_str = ''
    base_str = 'ABCDEFGHIGKLMNOPQRSTUVWXYZabcdefghigklmnopqrstuvwxyz0123456789'
    length = len(base_str) - 1
    for i in range(randomlength):
        random_str += base_str[random.randint(0, length)]
    return random_str

def generate_random_num(randomlength=100):
    return random.randint(0, randomlength)

# 集群C
startup_nodes = [
    { 
        "host": "10.4.7.221","port": 7001},
    { 
        "host": "10.4.7.222","port": 7011},
    { 
        "host": "10.4.7.222","port": 7002},
    { 
        "host": "10.4.7.223","port": 7012},
    { 
        "host": "10.4.7.223","port": 7003},
    { 
        "host": "10.4.7.221","port": 7013},
    { 
        "host": "10.4.7.221","port": 7004},
    { 
        "host": "10.4.7.222","port": 7014},
    { 
        "host": "10.4.7.222","port": 7005},
    { 
        "host": "10.4.7.223","port": 7015}
]


redis_conn = StrictRedisCluster(
    startup_nodes=startup_nodes, decode_responses=True, password='')

p = redis_conn.pipeline()
for i in range(0, 1000000):
    # 字符串
    p.set('cluster:C:hello_'+str(i).zfill(8), generate_random_str(10))
    if i % 1000 == 0:
        p.execute()
        print("========>executed:{}".format(i))

插入数据后记录一下集群各节点的数据数量和内存占用,用于比对集群混乱后的的数据分布情况。

redis-cli -h 10.4.7.221 -p 8001 dbsize
redis-cli -h 10.4.7.222 -p 8002 dbsize
redis-cli -h 10.4.7.223 -p 8003 dbsize
redis-cli -h 10.4.7.221 -p 9001 dbsize
redis-cli -h 10.4.7.222 -p 9002 dbsize
redis-cli -h 10.4.7.223 -p 9003 dbsize
redis-cli -h 10.4.7.221 -p 9004 dbsize
redis-cli -h 10.4.7.221 -p 7001 dbsize
redis-cli -h 10.4.7.222 -p 7002 dbsize
redis-cli -h 10.4.7.223 -p 7003 dbsize
redis-cli -h 10.4.7.221 -p 7004 dbsize
redis-cli -h 10.4.7.222 -p 7005 dbsize

redis-cli -h 10.4.7.222 -p 8011 dbsize
redis-cli -h 10.4.7.223 -p 8012 dbsize
redis-cli -h 10.4.7.221 -p 8013 dbsize
redis-cli -h 10.4.7.222 -p 9011 dbsize
redis-cli -h 10.4.7.223 -p 9012 dbsize
redis-cli -h 10.4.7.221 -p 9013 dbsize
redis-cli -h 10.4.7.222 -p 9014 dbsize
redis-cli -h 10.4.7.222 -p 7011 dbsize
redis-cli -h 10.4.7.223 -p 7012 dbsize
redis-cli -h 10.4.7.221 -p 7013 dbsize
redis-cli -h 10.4.7.222 -p 7014 dbsize
redis-cli -h 10.4.7.223 -p 7015 dbsize

redis-cli -h 10.4.7.221 -p 8001 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 8002 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 8003 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 9001 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 9002 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 9003 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 9004 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 7001 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7002 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 7003 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 7004 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7005 info memory | grep used_memory_human

redis-cli -h 10.4.7.222 -p 8011 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 8012 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 8013 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 9011 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 9012 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 9013 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 9014 info memory | grep used_memory_human
redis-cli -h 10.4.7.222 -p 7011 info memory | grep used_memory_human
redis-cli -h 10.4.7.223 -p 7012 info memory | grep used_memory_human
redis-cli -h 10.4.7.221 -p 7013 info memory | grep used_memory_human
redis-cli -h  

标签: 二极管ppm3fd201e0

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台