Introduction to KlustronDB DDL Transaction Processing Principles and Technologies
Introduction to KlustronDB DDL Transaction Processing Principles and Technologies
In relational database systems, users define all types of database objects in the database system through DDL statements, such as database, schema, table, index, view, materialized view, role/user, trigger, stored procedure, and setting the access control permissions for these database objects. It is the same in KlustronDB, and KlustronDB supports all these DDL functions.
Reliably implementing DDL functionality in a distributed database system not only means that DDL statements must be executed atomically, and that if the process is interrupted due to various hardware or software failures or network failures at cluster nodes, no intermediate states or intermediate data should remain — this requirement also applies to standalone databases — but also that the propagation of metadata updates described by each DDL across all nodes in the cluster must be either completely strictly synchronized, or the distributed database system must be able to detect inconsistencies in cluster metadata caused by unsynchronized propagation, avoiding program crashes or failures, and at least informing the client of the error so that the client can re-execute the operation.
Introduction to KlustronDB DDL Functionality
In KlustronDB, users can configure and start any number of computing nodes, all of which are peers. Clients can connect to any computing node to execute DDL or DML SQL statements.
The compute nodes of KlustronDB locally store all metadata of user data, including the metadata of each database object supported by KlustronDB (database, schema, table, view, index, sequence, materialized view, column, operator, domain, user, permission, CHECK constraint, stored procedure, trigger, etc.), as well as information about the KlustronDB cluster topology (all compute nodes of the cluster, storage shards and their storage nodes, metadata nodes, detailed information about computer servers).
KlustronDB can ensure that cluster node failures during the execution of DDL statements do not cause inconsistencies in the cluster's data or metadata. At the same time, the eventual consistency achieved by DDL replication in KlustronDB ensures that the local metadata of all computing nodes in the cluster (including newly added computing nodes) is exactly the same, allowing clients to connect to any computing node and correctly execute SQL statements.
This brings great convenience to application software developers and also simplifies the maintenance and management work of DBAs. Achieving this with a distributed database system faces a series of technical challenges. This article introduces how KlustronDB accomplishes it.
KlustronDB DDL Transaction Processing Architecture
DDL Log and globally consistent execution order
In the metashard, each KlustronDB cluster has a metadata table called DDL_logs, which records all DDL statements executed by the cluster from clients, as well as the DDL statements sent to storage nodes corresponding to each DDL statement (if any), and other status information. Each piece of such information is called a ddl_log. KlustronDB uses the DDL_logs table as a queue, where the ddl_log of the most recently executed DDL transaction is appended to the end of the queue.
During each execution of a DDL statement by a compute node (CN), it initiates a distributed transaction called ddl_txn. The participants in ddl_txn include the CN itself, the metashard, and the storage shard. Within the transaction branch on the CN, the CN queries and modifies metadata related to the database objects for insertion in its local metadata tables, in order to perform the metadata updates involved in this DDL; if the executed DDL involves user data tables, the storage nodes need to participate, so the CN sends appropriate DDL statements (usually quite different from the original DDL statement sent by the client) to the storage nodes where the data tables reside to execute them. At the same time, ddl_txn also needs to write ddl_log in the metashard, so within the distributed transaction branch initiated in the metashard, the CN performs these three tasks sequentially:
Lock the tail of the DDL_logs queue, which can prevent other concurrent DDL transactions from appending ddl_log to DDL_logs and will wait for other parallel DDL transactions to complete.
Ensure that it has completed all existing statements in the DDL_logs table, including any DDL statements currently being executed. If not completed, wait continuously.
Append your own ddl_log to DDL_logs

DDL log and its queued append write
During the execution of the above operations in a DDL transaction ddl_txn, other concurrently executed DDLs will wait to continue until ddl_txn commits due to mutual exclusion from the DDL_logs lock. This ensures that all DDL statements in the DDL_logs table are executed in order across the entire cluster, and that all computing nodes in the cluster execute these DDLs in exactly the same order, thereby ensuring that the metadata of all computing nodes in the cluster is fully equivalent. The term 'equivalent' is used here instead of 'identical' because OIDs of database objects are allocated locally, so the same database object may not have the same OID on different computing node instances. However, its name, other attributes, and associated relationships are always the same, and therefore they are equivalent.
DDL Distributed Transaction Processing

(Example of DDL transaction execution process)
Every DDL statement executed in KlustronDB is a distributed transaction, referred to as ddl_txn for convenience of description. KlustronDB ensures its atomicity through a distributed transaction processing mechanism [1], coordinated by the DDL transaction manager (referred to as ddl_trx_mgr) to manage the execution and commit of ddl_txn transactions. In the example process shown above, if participants of this DDL transaction (compute nodes, storage nodes, metadata nodes) encounter various hardware or software failures at key nodes 1, 2, 3, or 4, KlustronDB can ensure the ACID properties of the DDL transaction, with very few cases requiring manual DBA intervention, as described below.
The DDL transaction coordinator (hereinafter referred to as ddl_trx_mgr) is located on the compute node, and the participants of ddl_txn include:
- Computing Node
Update local metadata and execution coordinates to the pg_ddl_log_progress metadata table in the transaction branch of the compute node.
- storage shard
If a DDL involves user data, such as statements like CREATE/DROP/ALTER DATABASE/SCHEMA/TABLE/INDEX/SEQUENCE/MATERIALIZED VIEW, it is necessary to execute the DDL statement on the storage node to perform operations like creating, updating, or deleting data tables. This requires the transaction branch of the storage node to participate in ddl_txn to execute the corresponding DDL statement on the target storage node. Most DDL does not involve the storage shard, such as view, stored procedure, trigger, policy, user, role, permissions, etc.
- metashard
Each DDL transaction needs to write to the DDL_logs table. In the previous section, we have already detailed the writing process to the DDL_logs table.
The previous text has introduced the execution process of a DDL statement. This section focuses on the failures of any related nodes during the execution process and their recovery methods.
KlustronDB DDL Disaster Recovery
During the execution of a DDL transaction in KlustronDB, if any of the computing nodes, storage nodes, or metadata nodes involved in the transaction fail (reboot, hardware failure, crash, power outage, network disconnection, etc.), KlustronDB can ensure the ACID properties of the DDL transaction.
The following uses the CREATE TABLE statement as an example. The client sends a CREATE TABLE statement to the compute node, and the DDL transaction coordinator ddl_trx_mgr on the compute node first starts a transaction branch trx_cn internally on the compute node, where it executes the CREATE TABLE statement and modifies relevant metadata tables (such as pg_class, etc.). Then, on the metadata node, an XA transaction branch trx_meta is started and performs the aforementioned operation of writing to ddl_log in the DDL_logs table.
Failure Scenario 1
If writing to ddl_log fails, rollback trx_cn and trx_meta and return an error to the client.

(Handling DDL_logs write failure)
Failure Scenario 2
After successfully writing the ddl_log in trx_meta, if this DDL requires participation from the storage node, the compute node ddl_trx_mgr will send the DDL statement for the storage node (referred to as ddl_storage, not the DDL statement sent from the client to KlustronDB) to the storage node. If ddl_storage fails, ddl_trx_mgr will roll back trx_meta and trx_cn, and then return an error to the client. The storage node inherits the capabilities of the community version of MySQL-8.0, ensuring that if ddl_storage fails, no intermediate states or data are left behind. As shown in the figure below.
If a DDL statement does not involve storage nodes, such as DDL statements for views, triggers, stored procedures, users, roles, or privileges, then there is no need to send ddl_storage, so KlustronDB can definitely cleanly roll back this ddl_txn without leaving any intermediate state.

(Handling when the storage node fails to execute DDL)
Failure Scenario 3
If ddl_storage succeeds but the computing node exits before submitting trx_meta (for example, due to server failure, power outage, network failure, etc.), trx_cn will automatically roll back, while trx_meta will remain. The cluster_mgr component of KlustronDB will periodically look for such trx_meta in the metashard and roll them back. As shown in the figure below. In this case, the table created on the storage node needs to be manually deleted by the DBA. The specific reason is explained in the next section.
If a DDL statement does not involve storage nodes, such as DDL statements for views, triggers, stored procedures, users, roles, or privileges, then there is no need to send ddl_storage, so KlustronDB can definitely cleanly roll back this ddl_txn without leaving any intermediate state.
As for statements that send ddl_storage, such as CREATE/DROP/ALTER DATABASE/SCHEMA/TABLE/INDEX/SEQUENCE/MATERIALIZED VIEW, etc., they currently require DBA to manually handle failures, as described in detail in the next section.

(Handling of Compute Node Failures)
Problems Caused by the Limitations of MySQL Atomic DDL
Since DDL in MySQL-8.0 is atomic, but cannot be executed within explicit transactions, it is impossible to initiate an XA transaction branch on the storage node to execute DDL statements sent to the storage node and roll back when needed to remove the effects caused by these DDL statements. Therefore, when a DDL transaction needs to be rolled back, if the storage node needs to participate and ddl_storage has been successfully executed, KlustronDB cannot roll back the effects of the DDL executed on the storage node, and manual handling by the DBA is required.
In some cases, a DBA can handle it easily, such as CREATE TABLE, where the rollback method is simply to delete the table created on the storage node; however, in some cases, a DBA cannot handle it, such as DROP TABLE, where the table has already been dropped on the storage node and no longer exists, making it difficult to recover.
To prevent irrecoverable issues that may be caused by a DDL failure, the DBA can use KlustronDB's logical data backup and restore function. Before executing a DDL, first logically back up the table. If the DDL execution fails and the table needs to be rebuilt, the table can be quickly restored using the logical backup data that was just made for the table.
Another approach is that users can use the Online DDL feature provided by KlustronDB to first create the target table that needs to be created. Its table structure and definition can be equivalent to the table formed after performing all the necessary DDL and repartition operations on that table. Then, users can call the cluster_mgr API or use the XPanel GUI to load the source table data into the target table and then switch to using the newly created table.
In KunlunBase-1.3 version, we will make the DDL statements of storage nodes capable of two-phase commit in XA transactions, so the second phase can either commit or roll back, which can completely solve this problem.
Failure Scenario 4
If the ddl_storage statement executes successfully, ddl_trx_mgr sends a commit command to trx_meta to commit it, and then commits trx_cn. If the computing node crashes after committing trx_meta but before committing trx_cn, then trx_cn will naturally roll back. After the computing node restarts, the ddl_applier thread, upon startup, will begin searching for the last executed DDL on this node from the position recorded in pg_ddl_log_progress and re-execute it, thereby performing the necessary updates to the local metadata from the last execution again. As shown in the figure below.

(Recovery after compute node restart)
DDL Replication and Its Fault Recovery Mechanism
Each compute node (CN) uses the DDL replication mechanism to copy the updates to its local metadata made by other compute nodes in the same KlustronDB cluster, so that the local metadata of all compute nodes in the cluster is equivalent. The figure below is a schematic diagram of the DDL replication function.

(DDL Copy Function Diagram)
The background process ddl_applier of a compute node CN continuously performs lock-free SELECT queries on the DDL log of the metashard (and thus is not blocked by any ongoing DDL transaction) to obtain the next DDL statement executed by other compute nodes that has not yet been executed on this node, along with its ddl_op_id in the DDL_logs as the replication coordinate. It then replicates this DDL to modify the metadata on this node in the local transaction ddl_repl_trx, thereby replicating updates to local metadata tables from DDLs executed by other compute nodes. When ddl_applier replicates and executes DDL statements, it does not perform modifications to storage nodes because this operation has already been executed on the target storage nodes when the DDL was originally executed.

(Interactive diagram copied from DDL)
After executing DDL statements in ddl_repl_trx, the replication position ddl_op_id must also be stored in a dedicated metadata table pg_ddl_log_progress before committing ddl_repl_trx. If the DDL replication transaction ddl_repl_trx is interrupted for any reason, it will automatically roll back. At the same time, Ddl_applier always starts replication from the replication position recorded in pg_ddl_log_progress each time it starts. Therefore, DDL replication can be interrupted at any time, and the computing nodes can always accurately resume replication without any omissions or duplicates.

(Example of pg_ddl_log_progress metadata)
Eventual Consistency of Cluster Metadata Implemented Through DDL Replication
Ensuring the consistency of DDL transactions requires the following three to be consistent:
Compute Node Metadata
DDL log of the metashard
Optional, data objects of the storage shard
The consistency achieved by DDL replication is a form of eventual consistency. This means that among multiple compute nodes in a KlustronDB cluster, as well as between compute nodes and storage nodes, due to the slight time difference in executing DDL and replicating DDL, there will inevitably be a brief time window in which the metadata of these nodes is different. Executing DML statements within this time window may result in errors due to inconsistent metadata among the relevant nodes. KlustronDB can detect metadata inconsistencies and report errors; the client can then retry a few times (rarely more than once, after the DDL replication has executed the latest DDL) and it will succeed. For the sake of cluster performance, KlustronDB does not implement global locking to achieve fully consistent metadata alignment and verification; otherwise, the performance of DML would be severely compromised.
Compute Nodes and Storage Nodes
Due to the slight delay in DDL replication on each compute node, it is theoretically possible that a compute node CNx executes a DML statement to read or write the table tx before its local metadata is updated, resulting in a discrepancy between the metadata of table tx on CNx and on the storage node, thereby causing errors.
For example, the client connection conn1 connects to computing node CN1 to create table t1. Then the client connection conn2 immediately sends a DML statement to computing node CN2 to read or write t1, but CN2 reports an error saying that t1 does not exist. If the client sends the same DML statement again in conn2 one or several times, it succeeds.
For example, if a column colx of table t1 is dropped in client connection conn1, and then a SELECT query for colx is immediately executed in conn2, CN2 will receive an error from the storage node saying that colx does not exist. If the same SELECT statement is sent again in conn2, CN2 has already replicated and executed the relevant DDL and knows that colx has been dropped, so it directly reports an error to the client saying that colx does not exist.
Between computing nodes
If a DDL statement is sent in conn1 to create a stored procedure, view, etc., and a DML is immediately executed in conn2 to use this stored procedure or view, the DML may fail because CN2 has not yet replicated the execution of the related CREATE PROCEDURE/CREATE VIEW statement, resulting in an error saying that the stored procedure or view does not exist. Trying again after a short wait will succeed.
Summary
KlustronDB's DDL transaction processing and DDL replication features ensure that a KlustronDB cluster can reliably execute DDL statements and that all computing nodes within the cluster have the same metadata. DDL transaction processing can automatically recover in the vast majority of failure cases. DDL replication can always automatically recover correctly and continue replication execution.
In extremely special cases, when a node failure occurs, DDL transactions may leave intermediate data on the storage node that requires DBA manual recovery. For this, KlustronDB's Online DDL and/or logical backup recovery can be used to perform reliable DDL execution. See the main text for details.
For performance considerations, the eventual consistency of DDL replication may cause metadata to be inconsistent between multiple compute nodes within the cluster and between compute nodes and storage nodes within a small time window. KlustronDB can detect these inconsistencies and will return an error to the client. The client needs to retry to correctly execute the related DDL or DML statements.
