Global Deadlock Detection Technology in Klustron

Transaction deadlock is a prevalent issue in database systems that rely on transaction locking. Standalone databases like MySQL and Oracle possess mechanisms to handle such deadlocks. In distributed database systems like KunlunBase, a global deadlock may occur, which cannot be resolved by the deadlock handling mechanism of the storage node. Instead, a global deadlock handling mechanism at the level of the distributed database cluster is required to detect and resolve it. This article mainly discusses the underlying mechanisms leading to global deadlocks and the subsequent detection and resolution procedures employed in the Klustron cluster.

This article first introduces the overview and architecture of KunlunBase, then elaborates on why parallel transactions in standalone databases might culminate in a deadlock and explores their detection and resolution methods. It then discusses global deadlock in distributed cluster scenarios, and the global deadlock handling mechanism in the KunlunBase cluster.

The contents of this article were shared at an offline meetup held by the KunlunBase team in Shenzhen at the end of March 2023open in new window.

Fundamental Understanding, Occurrence Mechanism, and Hazards of Database Transaction Deadlocks

Database systems typically employ transaction locks for concurrency control to prevent logically conflicting operations from interfering with each other. For instance, if a user wants to drop table t1 in one connection (conn1), and another user in a different connection (conn2) initiates an insert into t1 operation, these two operations are in conflict. Likewise, if conn1 seeks to modify a specific row R1, and conn2 also intends to alter R1, the two operations conflict. To avoid such conflicts, the database system locks the table, page, and/or row where the data resides before reading or writing data. After acquiring the lock, the operations are executed, and the transaction locks are released upon transaction commit. Some database systems, like those using the InnoDB storage engine, utilize Multi-versioned concurrency control (MVCC) for concurrency control of read operations and transaction locks for write operations. MVCC does not acquire any transaction locks, so read operations using MVCC for concurrency control are not blocked by write operations nor do they form deadlocks, enabling the system to have higher concurrency performance.

The aforementioned examples illustrate that transaction locks are divided into multiple levels, including table-level, page-level, and/or row-level. Executing any DDL or DML statement requires first obtaining a table-level transaction lock. Some systems support page-level transaction locks, others support row-level, and a few accommodate both. The finer the granularity of the transaction lock, the higher its degree of concurrency. Typically, DML acquires table-level locks that are intention locks, so multiple DML table-level locks do not conflict and do not degrade concurrent performance. Conversely, DDL acquires non-intention table-level locks, thereby leading to potential conflicts with other DDL or DML statements operating on the same table.

The conflict matrix defines the conflict relationship of transaction locks against the same database object in the database, and the transaction lock system decides whether two transaction locks conflict based on the conflict matrix. The lock types and conflict definitions of different database systems may vary slightly, but certain elements of lock types and conflict relationships remain universally applicable to all transaction database systems.

T.X and T.S are table-level exclusive and shared locks. Generally, it requires acquiring T.X for a table before executing any DDL statement.

T.IX symbolizes a table-level intention write lock and T.IS indicates a table-level intention read lock. Table-level intention locks are usually obtained by DML statements, preventing the table from being truncated/dropped/renamed, etc. during execution of insert, delete, update, and select operations.

R.X is row-level write lock, which needs to be obtained before modifying, inserting, or deleting a row. If the read operation concurrency control uses MVCC, there is no need for row-level shared locks, otherwise, row-level and table-level shared locks R.S and T.S must be obtained.

In the matrix, 'X' represents that the two lock requests for the same table or the same row of a certain table are in conflict, and transaction lock requests from multiple transactions cannot be obtained simultaneously. '√' indicates that the two lock requests do not conflict. They are compatible, and multiple transactions can obtain the transaction locks they request concurrently.

conflict-matrix

Illustration of the Conflict Matrix

Deadlocks in a database occur naturally; their occurrence is neither an error nor a bug in the database or application system. It's mandatory for a database system to automatically detect and resolve deadlocks, and then return an error to the client. The client is responsible for handling the error returned when executing a statement. The usual treatment involves rolling back the transaction. However, for some applications, if a DML operation fails, it could be retried. Therefore, the application software may decide to re-execute the same DML operation or adjust it appropriately before executing it again after receiving a deadlock error. Occasionally, it may even directly ignore the failed statement (although this method is relatively rare).

The following three conditions are required for a database deadlock to occur:

If a new lock fails to be obtained, the transaction is blocked and unable to proceed. Otherwise, you might read "dirty data" - illogical and unreasonable data that is currently being overwritten.
Transactions do not release the locks they hold until they end. This is also known as Two Phase Locking (2PL), an effective method to reduce deadlocks and a requirement of the ACID properties of database transactions.
Circular wait: If two transactions, trx1 and trx2, each obtain the transaction lock needed by the other, a circular wait condition is formed. Below is an example of a circular wait condition.

cycular-wait

An Example of a Deadlock Circular Wait Condition

The Impact of Database Transaction Deadlocks and Their Resolution

If not addressed promptly, deadlocks can lead to query timeouts, statement failures, decreased system throughput, an increasing number of locked objects, and stalled transactions, gradually rendering the database system unable to continue operation.

Current transaction storage engines commonly support the detection and resolution of transaction deadlocks. This process is typically carried out when a transaction lock cannot be obtained or periodically by the deadlock processor completing a round of deadlock detection. This involves scanning the wait relationship of the lock system to identify lock waiting cycles. Then, one transaction within the cycle is chosen as the so-called "victim" and its lock request is denied. As a result, the DML statement currently being executed by the victim transaction will return an error, which the client needs to handle, such as by rolling back the transaction or re-executing the statement.

MySQL's InnoDB storage engine has its own deadlock detection mechanism, following this very procedure to detect and resolve deadlocks.

Occurrence Mechanism of Global Deadlocks in KunlunBase Distributed Database Cluster

A global deadlock is a novel type of database transaction deadlock that occurs at the global level of a distributed database cluster. This deadlock cannot be detected or resolved by a single storage node's deadlock processor. However, its occurrence mechanism and conditions are completely identical to those in a standalone database scenario.

The figure below illustrates an example of a global deadlock. In this scenario, GT1.T11, a transaction branch in the shard1's primary node SM1, updates R1. Meanwhile, GT2.T22, a transaction branch in the shard2's primary node SM2, updates R2. Then, when GT1's transaction branch, GT1.T12 on SM2, attempts to update R2, it is blocked by GT2.T22 when trying to acquire the row-level write transaction lock of R2, causing GT1 to wait for GT2 (noted as GT1 -> GT2).

At the same time, the transaction branch GT2.T21 of GT2 on SM1 tries to update R1 and is blocked by GT1.T11 when trying to acquire R1's row-level write transaction lock. This causes GT2 to wait for GT1 (notated as GT2 -> GT1).

This results in a circular wait condition, which is a global deadlock. From the perspectives of SM1 and SM2, no deadlock has occurred. As far as SM1 is concerned, there is only a lock wait relationship of T21->T11. From SM2's perspective, there is only a lock wait relationship of T12->T22. Therefore, neither SM1 nor SM2 can detect or resolve this global deadlock. At this point, the global deadlock processor of KunlunBase is required to detect and resolve the global deadlock.

gtxn-cycle

Global deadlocks have the following characteristics:

Local waits of storage node transaction branches lead to global transaction waits, with the global transactions forming a circular wait condition.
Deadlocks do not occur within each storage node (even if they do, they are automatically resolved and are not discussed in this section). Storage nodes cannot automatically detect or resolve global deadlocks.

3.The global transactions involved in the global deadlock cycle can be transactions initiated from multiple compute nodes or all from the same compute node.

Global Deadlock Resolution in the KunlunBase Distributed Database Cluster

Similar to deadlocks in a single-node database, global deadlocks naturally occur; they are neither bugs in the database system nor errors in the application software. KunlunBase can automatically detect and resolve global deadlocks, and it returns an error to the client in the connection where the victim resides. The client must also handle this error: they can rollback the transaction, re-execute (adjust the statement if needed) the statement, or ignore the error and continue executing the next statement (this approach is less common).

The detection and resolution of global deadlocks in KunlunBase are broken down into several steps: Firstly, it gathers the local transaction lock wait relationships from all the primary nodes of the storage clusters in the cluster, thereby constructing a global transaction lock wait relationship graph. Then, it traverses this graph to find waiting cycles. Upon finding a cycle, it chooses a 'victim' from the cycle to deny its lock request. Consequently, the client receives a statement execution error and handles the error as previously described.

Constructing a Global Transaction Wait Relationship Graph

KunlunBase's Global Deadlock Detector (GDD) obtains local transaction lock wait relationship graphs (g1, g2,... gn) from every primary shard node (M1, M2,... Mn) in the cluster. This is achieved by executing a SQL statement, as shown below, sent to every primary shard node in the KunlunBase cluster by the deadlock detection background process.

These graphs (g1, g2... gn) are then merged to form a single graph, G: the global transaction wait relationship graph.

The above query statement enables us to capture the local wait relationships (g) inside each storage node. This is because information_schema.innodb_trx is a system view in MySQL, providing basic information about each running transaction. We are interested in ongoing XA transactions because only they can act as global transaction branches to form a global deadlock. The view information_schema.data_lock_waits provides information about transaction lock wait relationships. This allows us to understand the wait relationship between any two InnoDB transactions. The SQL statement above allows us to connect these three tables and identify the wait relationships of active XA transactions within each storage node, thus providing gi.

In any local wait relationship graph gi, the wait relationship of a transaction branch in a single shard implies the wait relationship of its corresponding global transaction: ti -> tj => GTi -> GTj, where ti is a transaction branch of GTi in shard_i.

We construct a wait relationship graph, G, using GT1, GT2,... GTx as nodes and their wait relationships as edges. As KunlunBase supports parallel query execution between computing and storage nodes, a compute node can send INSERT/DELETE/UPDATE statements to multiple shard primary nodes asynchronously. These statements may form local wait relationships on multiple shards, leading each global transaction node in G to have multiple outgoing edges. In other words, a global transaction may depend on several other global transactions simultaneously. KunlunBase's GDD is well-equipped to handle these complex scenarios accurately.

Detection and Resolution of Global Deadlocks in KunlunBase

Once we have constructed the graph G, we can traverse this graph in order to detect (multiple) global transaction waiting cycles.

Upon detecting a cycle, we choose a victim from the cycle based on a particular elimination rule, and kill its statements on all shards, thus resolving a cycle in G. Currently, KunlunBase supports several strategies for eliminating global deadlocks, which can be selected by setting the configuration option global_deadlock_detector_victim_policy. The names of the strategies are self-explanatory, as listed below. Each strategy might have its own rationale, and if you indeed need precise adjustments, you can set them accordingly. Otherwise, you can use the system default setting.

KILL_OLDEST: Kills the longest running transaction.

KILL_YOUNGEST: Kills the shortest running transaction.

KILL_MOST_ROWS_CHANGED: Kills the transaction that has updated the most rows.

KILL_LEAST_ROWS_CHANGED: Kills the transaction that has updated the fewest rows.

KILL_MOST_ROWS_LOCKED: Kills the transaction that has locked the most rows.

KILL_MOST_WAITING_BRANCHES: Kills the transaction with the most transaction branches waiting for a transaction lock.

KILL_MOST_BLOCKING_BRANCHES: Kills the transaction with the most transaction branches blocking other transactions.

The code implementation for KunlunBase's global deadlock detection and resolution can be found at:

https://gitee.com/zettadb/kunlun/blob/main/src/backend/access/remote/remote_xact.c

Global deadlock test code: We have tested various scenarios https://gitee.com/zettadb/kunlun/blob/main/src/test/kunlun/gdd/gdd2.py

During the design and testing process of the global deadlock detection algorithm, we have considered all deadlock cycle situations shown in the figure below. Naturally, this algorithm is capable of handling all global deadlock situations, including more complex cases.

Triggering of KunlunBase's Global Deadlock Detection and Client Handling

After the compute node starts, it runs as a background process, called the Global Deadlock Detection Background Process (GDD). If a certain Create/Update/Delete (DML) statement takes longer than a defined time (start_global_deadlock_detection_wait_timeout) to return after sending a statement to the storage node, the compute node will notify the global_deadlock_detector process to trigger a round of global deadlock detection & resolution. In addition, the GDD also regularly runs in the background, set with a time interval of global_deadlock_detector.naptime=3s.

The global transaction chosen as the victim will return an error ER_QUERY_CANCELED to the client. By default, this transaction will be automatically rolled back within the compute node, and all subsequent statements in this transaction will be ignored until the transaction ends.

KunlunBase also supports MySQL's transaction handling mode. If enable_stmt_subxact = true is set before the start of the transaction, the execution error of the statement will not automatically roll back the transaction. Instead, the client code will decide how to handle the error. This mechanism applies to all errors, not just deadlock errors, and it is applicable to both MySQL and PostgreSQL connections. Upon receiving any statement execution error, the client can choose to ignore the error and continue executing subsequent SQL statements, then commit the transaction. Alternatively, they can re-execute this SQL statement, and the statements following it, then commit the transaction. Or they can directly roll back the transaction, and then they can re-execute the transaction if they wish.

The Klustron-server will record each globally rolled-back transaction in its running logs for each round of global deadlock detection, providing users with a trace and verification.

Conclusion

KunlunBase's global deadlock detection mechanism is designed for quick and effective detection and resolution of global deadlocks, ensuring efficient and smooth operation of the KunlunBase cluster. It is crucial for application software developers to address deadlock errors appropriately as described in this article. By doing so, the application system can effectively handle any deadlocks that might arise during the operation of the KunlunBase distributed database cluster, thus maintaining the system's optimal performance.