Introduction to Klustron Connection Snapshot Sharing

Background

In the process of executing queries involving multi-table joins, computing nodes may access multiple tables simultaneously. When two of these tables are both located on the same storage node, a problem of connection contention may arise.

The cause of connection contention is that, in order to satisfy the requirement of data consistency, computing nodes limit each user session to have only one MySQL connection with each storage node. This ensures that when accessing two tables, the same transaction snapshot is used on the storage node. However, a MySQL connection cannot handle multiple SQL requests simultaneously. It must completely process the current SQL request before processing the next one. Therefore, connection contention arises when multiple sessions need to access the same storage node simultaneously.

A Common Solution

To deal with connection contention, the computing node needs to materialize the results of the currently executing SQL on the connection to a local temporary file and release the connection for use by the most urgent operator, enabling the entire execution flow to continue.

The above figure is an example of connection contention. The user executes "select * from t1, t2 where t1.a=t2.a" and uses the MergeJoin algorithm (assuming that the MergeJoin execution plan has the lowest cost).

During execution, one of the RemoteScan(t1) operators executes first, reads t1's data from storage node A, and returns the received data to the upper-level MergeJoin operator. Then, the MergeJoin operator reads data from RemoteScan(t2) and joins it with t1, triggering RemoteScan(t2) to also request t2's data from storage node A.

However, the connection to storage node A is still occupied by the RemoteScan(t1) operator. To free up a connection for RemoteScan(t2), the computing node creates a temporary file "sql1_tmp.dat" for RemoteScan(t1) to store the results of sql1 that are still being executed. When RemoteScan(t1) needs to read the next row of data from t1, it can directly read the contents of the temporary file.

After sql1 is completely executed, the connection to storage node A is handed over to RemoteScan(t2).

Obviously, if the amount of materialized data is large, although this design meets the requirement of transaction consistency, it will also bring a series of significant problems, such as IO for writing temporary files, disk space occupation, and CPU consumption for encoding data, etc.

Therefore, we urgently need a lighter solution.

A Lighter Solution

The previous design was based on the characteristics of storage nodes (i.e., MySQL), which was a compromise to MySQL.

Imagine if MySQL provided an interface for multiple connections to share a snapshot, the computing node would not have to worry about connection contention, and only need to provide independent connections to each operator accessing the storage node. Unfortunately, MySQL does not provide such an interface because it is a single-machine database.

Fortunately, we have experienced MySQL kernel development experts who are fully capable of modifying it by themselves.

After modifying MySQL, the following interface is provided:

start
transaction read only from session $(other session thread id)；

This interface is used to start a read-only transaction and create a snapshot identical to that of the specified other connection. Moreover, this operation is very lightweight.

Using this interface, the computing node does not have to materialize the results of the SQL due to connection contention.

In the previous example, RemoteScan(t1) executes first and occupies the connection to storage node A. When RemoteScan(t2) executes, "Connection 1" is already occupied, so a new connection to storage node A is established, and a snapshot copy is obtained from "Connection 1" by calling the new interface. In this way, RemoteScan(t2) obtains a transaction snapshot identical to that of "Connection 1". This design is really much lighter compared to the previous one.

Results

In addition to resolving the issue of connection contention, this interface also unlocks the multi-process parallel execution feature of the computing node, which had been disabled due to transaction consistency concerns.

abc=# explain select count(1) From t;
QUERY PLAN
------------------------------------------------------------------------------------------
Finalize Aggregate  (cost=7961.55..7961.56 rows=1 width=8)
->  Gather (cost=7961.14..7961.54 rows=4 width=8)
 Workers Planned: 2
 ->  Parallel RemotePlan  (cost=7951.14..7951.14 rows=1 width=8)
 Shard :1 Remote SQL:
SELECT count(1) FROM `t` WHERE ($PARTIAL-QUAL)

When scanning large tables, the optimizer splits the table into multiple non-overlapping intervals based on statistical information, creating single-table parallel scan operators. Based on these single-table parallel scan operators, higher-level parallel aggregation operators are generated.

The final execution plan is illustrated in the above figure. The computing node's session process forks multiple working processes to scan different intervals of the table in parallel, thereby accelerating SQL execution. Before forking the working process, the session process ensures that a connection (assumed to be "Connection 1") is established with storage node A and a transaction is started on that connection. When forking the working process, the flag of this connection is passed to each working process. When the working process establishes a new connection with storage node A, it can obtain the same snapshot from "Connection 1".

Conclusion

In summary, this article explains how Klustron eliminates data materialization operations that the computing node performs during connection contention by sharing snapshots between connections, and how it ensures that the data read meets transaction consistency requirements when executing parallel plans.

Introduction to Klustron Connection Snapshot Sharing

Introduction to Klustron Connection Snapshot Sharing

Background

A Common Solution

A Lighter Solution

Results

Conclusion

END