跳至主要內容

Klustron Shard 故障隔离测试

Klustron大约 4 分钟

Klustron Shard 故障隔离测试

注意:

如无特别说明,文中的版本号可以使用任何已发布版本的版本号代替。所有已发布版本详见:http://doc.klustron.com/zh/Release_notes.html。

本文内容:

Klustron的金融级高可靠性,由一系列能力和技术组成了一个完整的技术体系,这些能力和机制坚如磐石,组成了Klustron坚不可摧的金融级更高可靠性技术体系,这其中当然也包括了故障隔离能力。本文主要测试当集群中的某一个shard主备完全故障之后,Klustron集群仍旧能够部分正常工作。故障shard上面存储的数据无法操作,但其他shard的数据则可以进行正常的读写和提交。

1. 测试环境

测试集群相关机器的配置信息如下:

名称节点类别IP端口Shard_ID
comp3计算节点192.168.40.15247001N/A
shard_1存储主节点192.168.40.151570036
存储备节点192.168.40.152
存储备节点192.168.40.153
shard_2存储主节点192.168.40.152570057
存储备节点192.168.40.153
存储备节点192.168.40.151
shard_3存储主节点192.168.40.153570075
存储备节点192.168.40.152
存储备节点192.168.40.151

2. 环境准备

创建测试数据库和用户

psql -h 192.168.40.152 -p 47001 postgres
create user kunlun_test with password 'kunlun';
create database testdb owner kunlun_test;
grant all privileges on database testdb to kunlun_test;

准备测试表sales_order,其按月分区。2023年1月分区的数据放在了shard_3上,2月数据放在了shard_1上,3月的数据则放在了shard_2上。

psql -h 192.168.40.152 -p 47001 -U kunlun_test testdb

CREATE TABLE sales_order
(
   order_number         INT NOT NULL,
   customer_number      INT NOT NULL,
   product_code         INT NOT NULL,
   order_date           DATETIME NOT NULL,
   entry_date           DATETIME NOT NULL,
   order_amount         DECIMAL(18,2) NOT NULL,
   PRIMARY KEY (order_number,order_date)
) partition by range(order_date);
CREATE TABLE sales_order_202301 PARTITION OF sales_order
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01') with (Shard=5);
CREATE TABLE sales_order_202302 PARTITION OF sales_order
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01') with (Shard=6);
CREATE TABLE sales_order_202303 PARTITION OF sales_order
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01') with (Shard=7);

并往其中灌入10000条数据

create or replace procedure generate_order_data()
AS $$
DECLARE
  v_customer_number integer;
  v_product_code integer;
  v_order_date date;
  v_amount integer;
  start_date date := to_date('2023-01-01','yyyy-mm-dd');
  i integer :=1;
BEGIN
	while i<=10000 loop
		v_customer_number := FLOOR(1+RANDOM()*6);
		v_product_code := FLOOR(1+RANDOM()*1000);
		v_order_date := start_date + CAST(FLOOR(RANDOM()*90) AS INT);
		v_amount := FLOOR(1000+RANDOM()*9000);
		INSERT INTO sales_order VALUES (i,v_customer_number,v_product_code,v_order_date,v_order_date,v_amount);
		commit;
		i := i+1;
	end loop;
END; $$
LANGUAGE plpgsql;
call generate_order_data();
analyze sales_order;
testdb=> select count(*) from sales_order;
 count 
-------
 10000
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
 count 
-------
  3086
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)

3. 模拟整个 shard 故障

此时需要模拟shard_1所有的主备节点都发生故障。由于Klustron有自动拉起失败服务的功能,所以先需要将shard_1的数据文件目录移动位置,在shard_1的主备节点都执行

cd /kunlun/storage_datadir
mv 57003 57003_bak

然后再分别在shard_1的主备节点使用kill -9命令杀掉在57003端口上的mysqld和mysqld_safe进程。

4. 故障隔离测试

回到计算节点重新查询2023年2月的数据,则会抛出异常:

testdb=> select count(*) from sales_order where order_date between '2023-02-01' and '2023-02-28';
ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)

查询1月和3月的数据则能正常返回结果

testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)

同样的,1月和3月数据对应的shard还能正常插入。

testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3417
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3497
(1 row)

testdb=> insert into sales_order values(10001,1,1000,'2023-01-31','2023-01-31',1800);
INSERT 0 1
testdb=> insert into sales_order values(10002,1,1000,'2023-03-31','2023-03-31',2000);
INSERT 0 1
testdb=> select count(*) from sales_order where order_date between '2023-01-01' and '2023-01-31';
 count 
-------
  3418
(1 row)

testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3498
(1 row)

testdb=> commit;

但是如果事务中涉及操作故障shard的数据,则会报错。

testdb=> begin;
BEGIN
testdb=> select count(*) from sales_order where order_date between '2023-03-01' and '2023-03-31';
 count 
-------
  3498
(1 row)

testdb=> update sales_order set order_amount=1000 where order_date='2023-03-31';
UPDATE 113
testdb=> insert into sales_order values(10003,1,1000,'2023-02-14','2023-02-14',1800);
ERROR:  Kunlun-db: Failed to connect to mysql storage node at (192.168.40.151, 57003): 2002, Can't connect to server on '192.168.40.151' (111)
testdb=> insert into sales_order values(10005,2,1000,'2023-03-30','2023-03-30',2000);
ERROR:  current transaction is aborted, commands ignored until end of transaction block
testdb=> rollback;
ROLLBACK

通过上面的测试可以看到,故障shard上面存储的数据无法提供任何服务了,但其他正常shard的数据还可以进行正常的读写和事务。

【END】