Introduction to KunlunBase DDL Transaction Processing Principles and Techniques

KlustronAbout 10 min

Introduction to KunlunBase DDL Transaction Processing Principles and Techniques

In a relational database system, users define various types of database objects using DDL statements. These objects include databases, schemas, tables, indexes, views, materialized views, roles/users, triggers, and stored procedures, along with configuring access control permissions for these objects. This holds true for KunlunBase as well, and KunlunBase supports all of these DDL functionalities.

Reliably implementing DDL functionalities in a distributed database system entails not only ensuring atomic execution of DDL statements, but also preventing any residual intermediate states and data in the event of cluster nodes being interrupted due to hardware/software failures, network issues, or exceptions. This requirement applies to single-node databases as well. Moreover, the process of propagating metadata updates described by each DDL statement across all nodes of the cluster must either be strictly synchronized or, in the case of any inconsistencies arising from asynchronous metadata propagation, the distributed database system must be able to detect the discrepancies between cluster metadata and local data/metadata of nodes. This is crucial for avoiding program crashes or failures. At the very least, errors must be communicated to the client to enable them to re-execute the necessary actions.

Introduction to KunlunBase DDL Functionality

In KunlunBase, users have the capability to configure any number of computing nodes, all of which are peers. Clients can connect to any computing node to execute DDL or DML SQL statements.

Each computing node in KunlunBase maintains locally all the metadata for user data. This includes metadata for every type of supported database object within KunlunBase, such as databases, schemas, tables, views, indexes, sequences, materialized views, columns, operators, domains, users, permissions, CHECK constraints, stored procedures, triggers, and more. Additionally, the KunlunBase cluster topology information, which encompasses all computing nodes within the cluster, storage shards along with their storage nodes, metadata nodes, and detailed information about the computing servers, is also stored locally.

KunlunBase ensures that during the execution of DDL statements, any cluster node failures will not lead to inconsistencies in data or metadata across the cluster. By implementing final consistency through DDL replication, KunlunBase ensures that the local metadata of all computing nodes (including newly added nodes) are entirely identical. This consistency allows clients to connect to any computing node and execute SQL statements correctly.

This convenience greatly benefits application software developers and simplifies the maintenance and management tasks of DBAs. Achieving this in a distributed database system involves a series of technical challenges. This article describes how KunlunBase accomplishes this.

KunlunBase DDL Transaction Processing Architecture

DDL Logs and Globally Consistent Execution Order

Within the metadata cluster, each KunlunBase cluster contains a metadata table named "DDL_logs." This table records all the DDL statements received from clients that the cluster has executed. It also records the corresponding DDL statements sent to storage nodes if applicable, along with other status information. Each of these pieces of information is referred to as a "ddl_log." KunlunBase treats the DDL_logs table as a queue, with the ddl_log of the latest executed DDL transaction being appended to the queue's end.

During the execution of a DDL statement on a computing node (CN), a distributed transaction called "ddl_txn" is initiated. Participants in ddl_txn include the CN itself, the metadata cluster, and the storage cluster. Within the CN's internal transaction branch, the CN queries and modifies, or inserts metadata related to the relevant database objects into its local metadata table. This is done to facilitate the execution of metadata updates associated with the DDL statement. If the executed DDL involves user data tables, storage nodes are involved. As a result, the CN sends appropriate DDL statements (usually different from the original DDL statement sent by the client) to the storage node where the data table resides, for execution.

Additionally, within the metadata cluster, ddl_txn needs to write ddl_log. Consequently, within the distributed transaction branch initiated within the metadata cluster, the CN performs the following three tasks sequentially:

Locks the tail of the DDL_logs queue, thereby preventing other concurrent DDL transactions from appending ddl_log to DDL_logs. The CN waits for parallel DDL transactions to complete.
Ensures that all existing statements in the DDL_logs table have been executed, including the currently executing DDL statement. It continues to wait if execution isn't complete.
Appends its own ddl_log to DDL_logs.

ddl-log-write

Writing DDL Logs and Queuing

During the execution of ddl_txn within a DDL transaction, other concurrently executing DDL transactions wait due to the mutual exclusion imposed by the DDL_logs lock. This ensures that all DDL statements within the DDL_logs table are executed in a queued and sequential manner across the entire cluster. Consequently, all computing nodes within the cluster execute these DDL statements in the exact same order, ensuring the full equivalence of metadata across all cluster computing nodes. It's important to note that equivalence is used here instead of identity because the Object IDs (OIDs) of database objects are locally allocated. In different instances of computing nodes, the OID of the same database object might not be the same, but their names, attributes, relationships, and associations are guaranteed to be identical, making them equivalent.

Distributed Transaction Handling for DDL

ddl-procedures

(Example of DDL Transaction Execution Flow)

In KunlunBase, every executed DDL statement is treated as a distributed transaction, referred to as ddl_txn for simplicity. KunlunBase ensures atomicity of these transactions using a distributed transaction processing mechanism, with coordination performed by the DDL transaction manager (referred to as ddl_trx_mgr later). The example execution process shown in the figure illustrates key points 1, 2, 3, and 4. If any participating components (computing nodes, storage nodes, metadata nodes) of this DDL transaction encounter software or hardware failures during these points, KunlunBase guarantees the ACID properties of the DDL transaction. In extremely rare cases, manual intervention by a database administrator (DBA) might be required, as explained later.

The DDL transaction coordinator (referred to as ddl_trx_mgr) is located on the computing node and involves the following participants in ddl_txn:

Computing Node

Within the computing node's transaction branch, local metadata is updated, and execution coordinates are stored in the pg_ddl_log_progress metadata table.

Storage Cluster

For DDL statements affecting user data, such as CREATE/DROP/ALTER DATABASE/SCHEMA/TABLE/INDEX/SEQUENCE/MATERIALIZED VIEW, actions like table creation, updates, and deletions are performed on storage nodes. This necessitates involvement of storage nodes' transaction branches within ddl_txn, executing corresponding DDL statements on the target storage nodes. Most DDL statements do not involve the storage cluster, including view creation, stored procedures, triggers, policies, users, roles, and permissions.

Metadata Cluster

Every DDL transaction requires writing to the DDL_logs table. The writing process was explained in the preceding section.

The execution process of a DDL statement was previously discussed. This section focuses on handling failures at any relevant nodes during the execution process.

KunlunBase DDL Resilience and Recovery

During the execution of a DDL transaction in KunlunBase, if any participating components—computing nodes, storage nodes, or metadata nodes—experience failures (reboots, hardware malfunctions, crashes, power outages, network disruptions, etc.), KunlunBase ensures the ACID properties of the DDL transaction.

Let's illustrate this with the example of a CREATE TABLE statement. When a client sends a CREATE TABLE statement to a computing node, the DDL transaction coordinator, ddl_trx_mgr, initiates a transaction branch (trx_cn) within the computing node. It executes the CREATE TABLE statement and modifies relevant metadata tables (e.g., pg_class) within that branch. Subsequently, a transaction branch (trx_meta) is initiated on the metadata node via an XA transaction to write ddl_log into the DDL_logs table.

Failure Scenario 1

If writing ddl_log fails, trx_cn and trx_meta are rolled back, and an error is returned to the client.

ddl-proc-err

(Handling Failed DDL_logs Writing)

Failure Scenario 2

After successful writing of ddl_log in trx_meta, if the DDL requires participation of storage nodes, the computing node's ddl_trx_mgr sends DDL statements specific to storage nodes (referred to as ddl_storage, distinct from DDL statements sent by clients to KunlunBase) to the relevant storage nodes. If ddl_storage fails, ddl_trx_mgr rolls back trx_meta and trx_cn, then returns an error to the client. Storage nodes inherit the capabilities of MySQL 8.0 Community Edition, ensuring that any failures during ddl_storage do not leave behind intermediate states or data. This is illustrated in the following figure.

For DDL statements that do not involve storage nodes, such as those pertaining to views, triggers, stored procedures, users, roles, and privileges, ddl_storage is unnecessary. Therefore, KunlunBase can reliably rollback this ddl_txn, leaving no intermediate states.

ddl-proc-err2

(Handling Failed DDL Execution on Storage Nodes)

Failure Scenario 3

If ddl_storage succeeds but the computing node crashes before submitting trx_meta (due to server failures, power outages, network disruptions, etc.), trx_cn automatically rolls back, while trx_meta remains. KunlunBase's cluster_mgr component periodically seeks out such remaining trx_meta within the metadata cluster and rolls them back. This situation results in tables being created on storage nodes that need to be manually removed by a DBA. The specific reasons are discussed in the following section.

For DDL statements that do not involve storage nodes, such as those for views, triggers, stored procedures, users, roles, and privileges, ddl_storage is unnecessary. Therefore, KunlunBase can reliably roll back this ddl_txn, leaving no intermediate states.

However, for DDL statements requiring ddl_storage, such as CREATE/DROP/ALTER DATABASE/SCHEMA/TABLE/INDEX/SEQUENCE/MATERIALIZED VIEW, manual intervention by a DBA might be temporarily required, as explained in the following section.

ddl-proc-err3

(Handling Computing Node Crashes)

MySQL's Atomic DDL Limitation and Its Implications

Due to the atomic nature of MySQL 8.0's DDL, but the inability to execute it within explicit transactions, initiating an XA transaction branch on a storage node for executing DDL statements and subsequently rolling back the effects of such DDL becomes impractical. Consequently, when a rollback of a DDL transaction is required, and storage nodes have participated and ddl_storage has successfully executed, KunlunBase cannot completely undo the effects of DDL executed on storage nodes. Manual intervention by a DBA is necessary in such cases.

In certain scenarios, a DBA can handle the situation relatively easily. For instance, for CREATE TABLE, the table can be deleted directly from the storage node to roll back the DDL effect. However, in some cases, DBAs might face challenges. For example, in the case of DROP TABLE, where the table has already been dropped on the storage node and no longer exists, recovery might not be straightforward.

To mitigate the risk of irreversible failures caused by failed DDL, DBAs can utilize KunlunBase's logical data backup and restore featureopen in new window. Before executing a DDL, a logical backup of the table can be performed. If the execution of a DDL fails and the table needs to be reconstructed, the logical backup data for that table can be used for table-level logical recovery to swiftly restore the table.

Another approach is for users to leverage KunlunBase's Online DDL featureopen in new window. This involves creating the target table beforehand, with its structure and definition equivalent to performing all the required DDL and repartition operations on the table. The cluster_mgr API or the XPanel GUI can then be used to migrate the source table data into the target table and subsequently switch to using the newly created table.

In KunlunBase version 1.3, storage node DDL statements will be made compatible with two-phase commit within XA transactions, enabling the second phase to involve both commits and rollbacks. This comprehensive solution will address the mentioned challenges.

Failure Scenario 4

If ddl_storage statements are executed successfully, ddl_trx_mgr sends a commit command to trx_meta for its submission, followed by committing trx_cn. If, after committing trx_meta but before committing trx_cn, the computing node crashes, trx_cn naturally rolls back. Upon restarting the crashed computing node, the ddl_applier thread begins by locating the last executed DDL record in pg_ddl_log_progress and replicates the execution, ensuring that the local metadata updates missed during the last execution are effectively performed again. This process is illustrated in the figure below.

ddl-proc-err4

(Recovery After Computing Node Restart)

DDL Replication and its Fault Recovery Mechanism

Each computing node (CN) utilizes the DDL replication mechanism to propagate updates from DDL transactions executed on other computing nodes within the KunlunBase cluster. This ensures that the local metadata of all computing nodes in the cluster remains equivalent. The following figure illustrates the functionality of DDL replication.

DDL复制功能示意图

(DDL Replication Functionality)

A background process, ddl_applier, in a computing node continuously performs non-locking SELECT queries on the DDL log within the metadata cluster (thus not being blocked by ongoing DDL transactions). This retrieves the next unexecuted DDL statement from other computing nodes along with its DDL_logs sequence number (ddl_op_id). This information serves as the replication coordinate. Ddl_applier then applies this DDL statement's modifications to the local metadata within a local transaction, ddl_repl_trx. This effectively replicates updates from DDL statements executed on other computing nodes to the local metadata table. However, ddl_applier does not perform modifications on storage nodes, as these were already executed on the target storage node when the original DDL was executed.

ddl-apply

(DDL Replication Interaction)

After applying the DDL statement within ddl_repl_trx, the replication coordinate ddl_op_id is stored in a dedicated metadata table, pg_ddl_log_progress. Subsequently, ddl_repl_trx is committed. If the DDL replication transaction ddl_repl_trx is interrupted for any reason, it is automatically rolled back. Additionally, each time ddl_applier starts, it resumes replication from the coordinate recorded in the pg_ddl_log_progress table. As a result, DDL replication can be interrupted and resumed without missing or repeating any steps.

ddl-progress-coord

(Example of pg_ddl_log_progress Metadata)

Achieving Cluster Metadata Eventual Consistency through DDL Replication

Ensuring the consistency of DDL transactions involves maintaining alignment among the following:

Computing node metadata
Metadata cluster's DDL log
Optionally, storage cluster's data objects

The consistency achieved through DDL replication is eventual consistency. This implies that across multiple computing nodes within a KunlunBase cluster and between computing nodes and storage nodes, minor discrepancies in metadata may exist due to the slight timing difference between DDL execution and DDL replication. During this time window, executing DML statements might result in errors due to inconsistent metadata across relevant nodes. KunlunBase can detect metadata inconsistencies and report errors. The client can retry (usually no more than once) until success, after DDL replication has caught up with the latest DDL execution. To maintain cluster performance, KunlunBase does not implement global locking for complete metadata alignment and validation. Otherwise, DML performance would be significantly impacted.

Computing Nodes and Storage Nodes

Due to the slight delay in DDL replication on each computing node, it is theoretically possible for the local metadata on computing node CNx not to be updated in time before executing a DML statement to read or write data in table tx. This could result in inconsistencies between the metadata of data table tx on storage nodes and computing node CNx's tx, leading to errors.

For example, when client connection conn1 connects to computing node CN1 to create table t1, and then client connection conn2 immediately sends a DML statement to computing node CN2 to read or write t1, CN2 might report an error stating that t1 does not exist. Sending the same DML statement once or multiple times in conn2 would then succeed.

Another example involves dropping a column colx from t1 in client connection conn1, followed by executing a SELECT query for colx in conn2. In this case, CN2 could receive an error from the storage node, indicating that colx does not exist. Sending the same SELECT statement in conn2 again would result in a correct error response stating that colx does not exist, as CN2 has already replicated and executed the relevant DDL indicating the column's removal.

Among Computing Nodes

If a DDL statement is sent through conn1 to create a stored procedure or view, and conn2 immediately executes a DML to use this stored procedure or view, the DML might fail. This is because CN2 has not yet replicated and executed the relevant CREATE PROCEDURE or CREATE VIEW statement. An error would be raised, stating that the used stored procedure or view does not exist. Waiting briefly and trying again would then lead to success.

Conclusion

KunlunBase's DDL transaction handling and DDL replication functionality ensure reliable execution of DDL statements within the KunlunBase cluster and maintain consistent metadata across all computing nodes. DDL transaction handling can automatically recover in the vast majority of failure scenarios. DDL replication is designed to automatically recover and continue replication accurately.

In extremely rare cases, if a node failure occurs, DDL transactions may leave behind intermediate data on storage nodes, requiring manual recovery and intervention by DBAs. To address this, KunlunBase offers Online DDL and/or logical backup and restore capabilities for reliable DDL execution. Refer to the main text for more details.

For the sake of performance, the eventual consistency achieved through DDL replication might result in minor discrepancies within a small time window between multiple computing nodes in the cluster and between computing nodes and storage nodes. KunlunBase can detect these inconsistencies and return errors to clients, who can simply retry to execute the relevant DDL or DML statements correctly.