Klustron Shard 故障隔离测试
大约 4 分钟
Klustron Shard 故障隔离测试
注意:
如无特别说明,文中的版本号可以使用任何已发布版本的版本号代替。所有已发布版本详见:http://doc.klustron.com/zh/Release_notes.html。
本文内容:
Klustron的金融级高可靠性,由一系列能力和技术组成了一个完整的技术体系,这些能力和机制坚如磐石,组成了Klustron坚不可摧的金融级更高可靠性技术体系,这其中当然也包括了故障隔离能力。本文主要测试当集群中的某一个shard主备完全故障之后,Klustron集群仍旧能够部分正常工作。故障shard上面存储的数据无法操作,但其他shard的数据则可以进行正常的读写和提交。
1. 测试环境
测试集群相关机器的配置信息如下:
名称 | 节点类别 | IP | 端口 | Shard_ID |
---|---|---|---|---|
comp3 | 计算节点 | 192.168.40.152 | 47001 | N/A |
shard_1 | 存储主节点 | 192.168.40.151 | 57003 | 6 |
存储备节点 | 192.168.40.152 | |||
存储备节点 | 192.168.40.153 | |||
shard_2 | 存储主节点 | 192.168.40.152 | 57005 | 7 |
存储备节点 | 192.168.40.153 | |||
存储备节点 | 192.168.40.151 | |||
shard_3 | 存储主节点 | 192.168.40.153 | 57007 | 5 |
存储备节点 | 192.168.40.152 | |||
存储备节点 | 192.168.40.151 |
2. 环境准备
创建测试数据库和用户
psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;
准备测试表sales_order,其按月分区。2023年1月分区的数据放在了shard_3上,2月数据放在了shard_1上,3月的数据则放在了shard_2上。
psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb
CREATE TABLE sales_order
(
order_number INT NOT NULL,
customer_number INT NOT NULL,
product_code INT NOT NULL,
order_date DATETIME NOT NULL,
entry_date DATETIME NOT NULL,
order_amount DECIMAL(18,2) NOT NULL,
PRIMARY KEY (order_number,order_date)
) partition by range(order_date);
CREATE TABLE sales_order_202301 PARTITION OF sales_order
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01') with (Shard=5);
CREATE TABLE sales_order_202302 PARTITION OF sales_order
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01') with (Shard=6);
CREATE TABLE sales_order_202303 PARTITION OF sales_order
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01') with (Shard=7);
并往其中灌入10000条数据
create or replace procedure generate_order_data()
AS $$
DECLARE
v_customer_number integer;
v_product_code integer;
v_order_date date;
v_amount integer;
start_date date := to_date('2023-01-01','yyyy-mm-dd');
i integer :=1;
BEGIN
while i<=10000 loop
v_customer_number := FLOOR(1+RANDOM()*6);
v_product_code := FLOOR(1+RANDOM()*1000);
v_order_date := start_date + CAST(FLOOR(RANDOM()*90) AS INT);
v_amount := FLOOR(1000+RANDOM()*9000);
INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
commit;
i := i+1;
end loop;
END; $$
LANGUAGE plpgsql;
call generate_order_data();
analyze sales_order;
testdb=> select count(*) from sales_order;
count
-------
10000
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
count
-------
3086
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3497
(1 row)
3. 模拟整个 shard 故障
此时需要模拟shard_1所有的主备节点都发生故障。由于Klustron有自动拉起失败服务的功能,所以先需要将shard_1的数据文件目录移动位置,在shard_1的主备节点都执行
cd /kunlun/storage_datadir
mv 57003 57003_bak
然后再分别在shard_1的主备节点使用kill -9命令杀掉在57003端口上的mysqld和mysqld_safe进程。
4. 故障隔离测试
回到计算节点重新查询2023年2月的数据,则会抛出异常:
testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
ERROR: Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
查询1月和3月的数据则能正常返回结果
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3497
(1 row)
同样的,1月和3月数据对应的shard还能正常插入。
testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3417
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3497
(1 row)
testdb=> insert into sales_order values(10001,1,1000,'2023-01-31','2023-01-31',1800);
INSERT 0 1
testdb=> insert into sales_order values(10002,1,1000,'2023-03-31','2023-03-31',2000);
INSERT 0 1
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
count
-------
3418
(1 row)
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3498
(1 row)
testdb=> commit;
但是如果事务中涉及操作故障shard的数据,则会报错。
testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
count
-------
3498
(1 row)
testdb=> update sales_order set order_amount=1000 where order_date='2023-03-31';
UPDATE 113
testdb=> insert into sales_order values(10003,1,1000,'2023-02-14','2023-02-14',1800);
ERROR: Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
testdb=> insert into sales_order values(10005,2,1000,'2023-03-30','2023-03-30',2000);
ERROR: current transaction is aborted, commands ignored until end of transaction block
testdb=> rollback;
ROLLBACK
通过上面的测试可以看到,故障shard上面存储的数据无法提供任何服务了,但其他正常shard的数据还可以进行正常的读写和事务。
【END】