Project and Filter Pushdown
Project and Filter Pushdown
Background
The previous article discussed the Query Optimization Process of Klustron, and this article will demonstrate the pushdown of Project and Filter operations.
1. Basic Information of the Test Table
1.1 Test Environment
This test demonstrates the pushdown of project and filter operations.
The database cluster in the test environment consists of four data nodes (DN) configured as two shards (shard1 and shard2), with each shard node composed of a master node and a slave node (shard1 has two nodes in data replication relationship, and shard2 also has two nodes in data replication relationship, with different sharded data tables stored in shard1 and shard2). As shown in the figure below: 
The node information of the cluster environment can be displayed by the following statement:
select t1.name, t2.shard_id, t2.hostaddr, t2.port,
t2.user_name, t2.passwd from pg_shard t1, pg_shard_node t2
where t2.shard_id=t1.id;
The results are shown in the following figure:

1.2 Table Structure
The table structure for this test is as follows:

1.3 Sharding Information
Table "Customer1" is partitioned by range according to the "c_id" field, and the corresponding partitioned tables are Customer1_1,Customer1_2,Customer1_3,Customer1_4。
The four sharded data are stored in two shards respectively:
select t1.nspname, t2.relname,t2.relshardid, t2.relkind
from pg_namespace t1 join pg_class t2 ont1.oid = t2.relnamespace
where t2.relshardid != 0 and relkind='r' and relname like
'%customer1%' order by t2.relname;
The results are shown in the following figure:

1.4 Data Distribution
Data is stored in different sharded storage based on the sharding rules.
The data distribution of the four partitions is shown in the following figure:

2. Query Process
2.1 Filter
Filtering is used to locate the scope of the query. By filtering, the number of data blocks read and written can be reduced.
Execution process: After the computing node interprets the SQL statement, RemoteScan pushes the query statement to the corresponding storage node for execution based on the shard information of the target table "customer1" (filtering out irrelevant shards). The storage node returns the query result to the computing node.
The computing node adopts an asynchronous operation mode for query pushdown, and multiple storage nodes can be executed in parallel.
The following statement is pushed down to the corresponding storage node (shard2) according to the range of the shard key value (this is the first filtering of the SQL execution engine).
explain select c_id,c_first,c_middle from customer1
where c_id=4;
The result is shown in the following figure:

The marked place is the filtering operation in the execution plan.
If the query condition does not have the sharding key, RemoteScan will send the statement to all sharding nodes, and filter the sharding results according to the query conditions.
explain select c_id ,c_middle,c_data from customer1
where c_middle='OE';
The result is shown in the following figure:

Klustron supports using functions and various complex conditions in the query conditions.
explain select count(*) from customer1
where sqrt(c_discount)>0.5;
The result is shown in the following figure:

2.2 Project
After the computing node interprets the SQL statement, RemoteScan pushes the query statement to the corresponding storage node for execution based on the shard information of the target table "customer1". The storage node returns part of the fields of the query result row (project) to the computing node.
The Project operation reduces the actual number of fields returned to the computing node (only the fields required by the query will be read into the memory of the computing node, and irrelevant fields will not be read in).
The execution process is shown in the following figure:

explain select c_id,c_first,c_middle from customer1;
The result is shown below:

In the execution process of the above statement, RemoteScan only returns the data of fields c_id, c_first, and c_middle to the computing node.
Klustron supports functions and various complex conditions on fields in Project.
explain selectc_balance + c_ytd_payment from customer1
where c_id =20;
The result is shown below:

The storage node only returns the result of c_balance + c_ytd_payment to the computing node.
2.3 Benefits of Sharding
Distributed databases transparently distribute the database across different storage nodes, resulting in the following benefits:
Application transparency, meaning that applications can access data without any modifications.
Storage nodes perform read and project operations on data, reducing the workload of computing nodes (lower CPU and memory consumption).
Data is distributed across different storage nodes, facilitating the system's ability to flexibly extend IO capacity and avoid IO hotspots.
Project and filter operations reduce the range of data block reads and writes, improving query efficiency.
3. Performance Comparison
To compare query efficiency, we created a new table in the same environment as customer1 with the exact same amount of data. We then executed the same query on both the sharded and non-sharded tables.
3.1 Query on Sharded Table
select now(); select count(*) from customer1
where c_id between 4 and 10;select now();
Result: 
Time Taken: 0.05s
3.2 Query on Non-Sharded Table
We created a new table using the same schema as customer1:
CREATE TABLEcustomer1temp (like customer1);
insert intocustomer1temp select * From customer1
Result:

We then ran the following query:
select now(); select count(*) from customer1temp
where c_idbetween 4 and 10 ;select now();
Result: 
Time Taken: 0.69s
Conclusion
Using project and filter operations, sharding can improve data query efficiency.