Skip to main content

Project and Filter Pushdown

KlustronAbout 3 min

Project and Filter Pushdown

Background

The previous article discussed the Query Optimization Process of Klustron, and this article will demonstrate the pushdown of Project and Filter operations.

1. Basic Information of the Test Table

1.1 Test Environment

This test demonstrates the pushdown of project and filter operations.

The database cluster in the test environment consists of four data nodes (DN) configured as two shards (shard1 and shard2), with each shard node composed of a master node and a slave node (shard1 has two nodes in data replication relationship, and shard2 also has two nodes in data replication relationship, with different sharded data tables stored in shard1 and shard2). As shown in the figure below: ![](Project 和 Filter 下推/1.png)

The node information of the cluster environment can be displayed by the following statement:

select t1.name, t2.shard_id, t2.hostaddr, t2.port, 
t2.user_name, t2.passwd from pg_shard t1, pg_shard_node t2  
where t2.shard_id=t1.id;

The results are shown in the following figure:

![](Project 和 Filter 下推/2.png)

1.2 Table Structure

The table structure for this test is as follows:

![](Project 和 Filter 下推/3.png)

1.3 Sharding Information

Table "Customer1" is partitioned by range according to the "c_id" field, and the corresponding partitioned tables are Customer1_1,Customer1_2,Customer1_3,Customer1_4。

The four sharded data are stored in two shards respectively:

select t1.nspname, t2.relname,t2.relshardid, t2.relkind
from pg_namespace t1 join pg_class t2 ont1.oid = t2.relnamespace 
where t2.relshardid != 0 and relkind='r' and relname like 
'%customer1%' order by t2.relname;

The results are shown in the following figure:

![](Project 和 Filter 下推/4.png)

1.4 Data Distribution

Data is stored in different sharded storage based on the sharding rules.

The data distribution of the four partitions is shown in the following figure:
![](Project 和 Filter 下推/5.png)

2. Query Process

2.1 Filter

Filtering is used to locate the scope of the query. By filtering, the number of data blocks read and written can be reduced.

Execution process: After the computing node interprets the SQL statement, RemoteScan pushes the query statement to the corresponding storage node for execution based on the shard information of the target table "customer1" (filtering out irrelevant shards). The storage node returns the query result to the computing node.

The computing node adopts an asynchronous operation mode for query pushdown, and multiple storage nodes can be executed in parallel.

The following statement is pushed down to the corresponding storage node (shard2) according to the range of the shard key value (this is the first filtering of the SQL execution engine).

explain select c_id,c_first,c_middle from customer1
where c_id=4;

The result is shown in the following figure:

![](Project 和 Filter 下推/6.png)

The marked place is the filtering operation in the execution plan.

If the query condition does not have the sharding key, RemoteScan will send the statement to all sharding nodes, and filter the sharding results according to the query conditions.

explain select c_id ,c_middle,c_data from customer1
where c_middle='OE';

The result is shown in the following figure:

![](Project 和 Filter 下推/7.png)

Klustron supports using functions and various complex conditions in the query conditions.

explain select count(*) from customer1 
where sqrt(c_discount)>0.5;

The result is shown in the following figure:

![](Project 和 Filter 下推/8.png)

2.2 Project

After the computing node interprets the SQL statement, RemoteScan pushes the query statement to the corresponding storage node for execution based on the shard information of the target table "customer1". The storage node returns part of the fields of the query result row (project) to the computing node.

The Project operation reduces the actual number of fields returned to the computing node (only the fields required by the query will be read into the memory of the computing node, and irrelevant fields will not be read in).

The execution process is shown in the following figure:

![](Project 和 Filter 下推/9.png)

explain select c_id,c_first,c_middle from customer1;

The result is shown below:

![](Project 和 Filter 下推/10.png)

In the execution process of the above statement, RemoteScan only returns the data of fields c_id, c_first, and c_middle to the computing node.

Klustron supports functions and various complex conditions on fields in Project.

explain selectc_balance + c_ytd_payment  from customer1
where c_id =20;

The result is shown below:

![](Project 和 Filter 下推/11.png)

The storage node only returns the result of c_balance + c_ytd_payment to the computing node.

2.3 Benefits of Sharding

Distributed databases transparently distribute the database across different storage nodes, resulting in the following benefits:

  • Application transparency, meaning that applications can access data without any modifications.

  • Storage nodes perform read and project operations on data, reducing the workload of computing nodes (lower CPU and memory consumption).

  • Data is distributed across different storage nodes, facilitating the system's ability to flexibly extend IO capacity and avoid IO hotspots.

  • Project and filter operations reduce the range of data block reads and writes, improving query efficiency.

3. Performance Comparison

To compare query efficiency, we created a new table in the same environment as customer1 with the exact same amount of data. We then executed the same query on both the sharded and non-sharded tables.

3.1 Query on Sharded Table

select now(); select count(*)  from customer1 
where c_id between 4 and 10;select now();

Result: ![](Project 和 Filter 下推/12.png)

Time Taken: 0.05s

3.2 Query on Non-Sharded Table

We created a new table using the same schema as customer1:

CREATE TABLEcustomer1temp  (like  customer1);
insert intocustomer1temp select * From  customer1

Result:

![](Project 和 Filter 下推/13.png)

We then ran the following query:

select now(); select count(*) from customer1temp  
where c_idbetween 4 and 10 ;select now();

Result: ![](Project 和 Filter 下推/14.png)

Time Taken: 0.69s

Conclusion

Using project and filter operations, sharding can improve data query efficiency.

END