Research and Modification Summary of MySQL-8.0 Group Replication
Research and Modification Summary of MySQL-8.0 Group Replication
Since late February 2020, I have been researching and improving several new features of Percona-MySQL-8.0.18-9, mainly MySQL Group Replication (MGR) and clone functionalities. I have also filled in some functional gaps in terms of distributed transaction resilience and fixed several bugs and defects, as well as developed other functionalities for Klustron. In this article, I will share some of my findings and thoughts based on Percona-MySQL-8.0.18-9. I do not intend to provide a complete introduction to any new features of MySQL 8.0, as other peers have already written several articles on this topic, and the most authoritative and complete introduction is always the official MySQL documentation. This article assumes that the reader is already familiar with these concepts and features. I will focus on the exploration and discoveries I made on the MGR and clone functionalities of Percona-MySQL-8.0.18-9 for the needs of Klustron, as well as the improvements I made to MGR in terms of distributed transaction resilience.
Despite many vulnerabilities and functional gaps in terms of distributed transaction resilience and recovery, MGR single primary mode provides excellent data consistency guarantee and resilience capability while maintaining high performance, making it suitable for widespread use in production. MGR multi-primary mode is only suitable for scenarios with a small amount of writes and a heavy load of reads, such as storing a small amount of parameter configurations, similar to the use case of ZooKeeper, as it generates significant TPS and latency jitter due to a large number of transaction rollbacks caused by data conflicts between multiple nodes. In multi-primary mode, if the application layer cleverly arranges the data set of each master node's writes or modifications so that they have no intersection to avoid write conflicts, the complexity of application development will be significantly increased, and existing applications will not be able to adapt automatically, so there will not be many users who do this. Finally, regardless of single primary mode, multi-primary mode, or traditional asynchronous replication mode, a MySQL cluster can only handle data within several TB, as all nodes in the cluster store N copies of exactly the same data, and the data processing capability of the cluster is still limited by the data processing capability of each node, which in turn is limited by factors such as server node's computing resources and hardware costs and cannot continuously increase its data storage and processing capabilities. Therefore, when the data size grows to a certain scale, it is still necessary to use a distributed database cluster.
In Klustron, we will use MGR single primary mode, and each storage shard will be an MGR single primary cluster. All data is stored in multiple storage shards, and the computing node ensures that the data in different shards has no intersection.
MGR Automatic Cluster Management
One advantage of MGR is its automatic cluster management capability. This includes not only automatic switching of the master node, but also automated group membership management and provisioning based on distributed recovery technology, which allows for automatic addition of new slave nodes during runtime.
Let's start with switching: all nodes in the cluster periodically run the Paxos protocol to exchange node status, allowing each node to automatically discover new nodes and detect node failures. If the master node fails, all nodes will independently run an election algorithm to select a new master node, and then synchronize this decision throughout the cluster. If several nodes fail and cause a loss of quorum, the cluster will automatically switch to read-only mode to prevent split-brain scenarios. This eliminates the need for external Paxos clusters, reducing maintenance workload and minimizing potential errors. In fact, MGR combines Paxos and master-slave replication to achieve atomic broadcast of transactional binlogs, significantly enhancing the cluster's fault tolerance. We will cover this later.
Through MGR's group membership management, the cluster can track node join and leave events, and synchronize these states across all nodes. When the master node leaves, a new master election is automatically triggered; when the cluster loses quorum due to network partition or node failure, the master node is automatically set to read-only to avoid split-brain scenarios. Each group of members resulting from a node join or leave event in MGR is called a "view", and each change in the view is a "view change" event, which represents a member joining or leaving. These events are recorded in the binlog, so changes to the group topology are persistent and globally consistent. Therefore, view change events are used as a synchronization stop point for the group_replication_recovery channel in MGR's distributed recovery.
With distributed recovery, we can easily add a new slave node with a single SQL statement or bring an old slave node up to date. This feature automates and simplifies two complex and important tasks.
Firstly, we can use it for full data backups, replacing tools like Percona Xtrabackup - simply create a new DB instance, add it as a slave node to the MGR cluster, execute the "start group_replication" statement, and once distributed recovery is complete, shut down the instance. The instance now has all the data from the current master node. Archiving its data directory will create a full backup.
Secondly, redoing replica nodes, which is a common operation for DBAs, can now be easily accomplished with a single command. However, in MGR's single primary mode, DBAs likely won't need to redo replica nodes as often as before. In the past, redoing replica nodes was the big gun and ultimate weapon for MySQL DBAs - when a replica thread was stuck for various reasons, when the master and replica nodes were disconnected and the replica node could not connect to the master to continue replication (such as when some binlogs required by the replica node were purged), when MyISAM tables were used and the replica node data was corrupted due to abnormal exit, or when the replica node replication was too slow and falling further behind the master node data version, DBAs needed to redo the replica nodes. For a DBA managing hundreds of MySQL clusters, redoing replica nodes is a daily task. If they didn't redo any replica nodes one day, that would be unusual, and they would wonder if something went wrong. In the past, DBAs used tools like Percona Xtrabackup and mysqlbinlog to trace (apply) binlogs when redoing replica nodes. In TDSQL, there's even a cooler way to trace binlogs - treat the binlogs replicated from the master node as relay logs, execute the "change master to" operation on the DB node created from full data, and simulate a replica node to trace (apply) these binlogs (this requires a specific function developed for MySQL server).
Now, all of these tricks and hacks are unnecessary - users only need to create a new empty DB instance, configure the MGR parameters, and run "start group_replication". MGR will perform distributed recovery: if a replica node is found to be completely new or too far behind the latest binlog on the master node or if no group node has the binlog required by the new node, a full InnoDB data clone (similar to the full copy operation performed by Percona Xtrabackup) will be created from a donor node, and traditional asynchronous replication combined with MGR will replay transactional binlogs to complete the recovery of the DB instance. This step is similar to the binlog tracing in the past, but divided into two steps: first, asynchronous replication traces binlogs to the joining moment of the node (marked by the view change), and then replay the binlogs received during the first two stages.
Unlike the traditional method of creating replica nodes in the industry, MySQL MGR's distributed recovery is more efficient and user-friendly. For example, binlogs and clone processes can be transmitted in parallel with multiple threads, and both can compress the transmission of binlogs and InnoDB data files. It also includes a flow control configuration to prevent exhaustion of disk and network bandwidth, which may impact business request processing. During the clone process, DML statements are not blocked (DDL statements are blocked), which means that there is little impact on business request processing. The reason why FTWRL can be avoided is due to MySQL 8.0's transactional data dictionary (stored in InnoDB tables) and the reality that all metadata tables are stored in InnoDB. However, MySQL 8.0's clone function is inferior to Percona Xtrabackup in that it does not support other storage engines, only InnoDB, and is bound to InnoDB.
MGR Transaction Atomic Broadcast
Another major advantage of MGR is its atomic broadcast function based on the Paxos protocol. This ensures that when any transaction is submitted on the master node, MGR first ensures that a simple majority (quorum) of nodes receive the complete binlog of the transaction before each node begins local submission. This avoids many master-slave disaster recovery issues in the previous asynchronous replication mode. These issues are essentially due to the fact that at the moment of a crash on the master node, the binlogs of the transactions being submitted may be inconsistent among the master and some or all of the replica nodes. For example, if transaction T is being submitted when the master node crashes, some replica nodes may receive the complete binlog of T, while others may not receive T's binlog at all, and still other replica nodes may receive only a part of T's binlog. If T is an XA transaction, in MySQL 5.7, if the master and replica nodes are disconnected when an XA transaction is incomplete, the replica node's event distribution thread will continue to wait for the remaining binlog events of the transaction under certain conditions but cannot roll back the transaction and re-execute it, resulting in the replica node's replication being completely stuck, rendering the node unusable. If this replica node is to be selected as the master node at this time, it will be unable to complete "stop slave" and cannot be converted to a master node, resulting in this storage shard being unwritable. I have already solved such problems in TDSQL, and with MGR's atomic broadcast ability, these issues have been completely resolved in MySQL 8.0.
Improvements in replica node replication performance with MGR
MySQL 5.7's replica node replication is based on the logical clock algorithm, which has a potential and less apparent drawback, namely that the concurrency of replica node replication is limited by the number of client connections on the master node. This is because in the logical clock algorithm, the total number of transactions whose sequence number is not greater than the current global sequence number of the system at any given time cannot exceed the number of client connections. For example, if there are only 20 connections executing transactions on the master node at all times, then the concurrency of the replica node will not exceed 20, which means that if the replica node needs to replay a large amount of binlog generated by a small number of connections, it will be very slow. Additionally, if a table does not have a primary key or unique index, replica node replication will be slow when encountering transactions that delete or update a large number of rows, which causes the replica node to fall further behind the master node, and then there will be no replica node available to quickly perform master-slave switching. Once the master node fails, the cluster becomes unwritable for a relatively long period of time. At this point, DBAs become nervous, and sometimes they may create a new replica node just to be on the safe side for critical businesses.
MGR can easily determine the dependency relationship of transactions based on the write set, so that as long as two transactions do not conflict, they can be executed in parallel. If transactions T1 and T2 modify the same row R in succession, then transaction T2 depends on T1, that is, T2 can only be executed after T1 is executed. If the write set of T1 and T2 has no intersection, then T1 and T2 have no dependency relationship and can be executed in parallel. MGR can even optionally execute non-conflicting transactions generated on the same connection in parallel (binlog_transaction_dependency_tracking = writes_set). However, considering that this does not comply with the temporal logic, especially if the replica node is used for replica reads, this approach is not suitable. It is recommended to use binlog_transaction_dependency_tracking = write_set_session. Based on the write_set and write_set_session, slave replication in MGR completely eliminates the limitation of client concurrency, achieving very high replication performance. Furthermore, MGR's requirement for tables to have a primary key or unique index means that the problem of a slave falling behind the master and unable to catch up is completely solved. It is expected that the distance between slave and master machines in a production system can generally be kept very close, which allows for quick implementation of master-slave switching in any situation. There will be no problem of not being able to quickly complete the master-slave switch due to a lack of newer slaves that can be used immediately. Additionally, with the full use of transaction storage engines and MGR's atomic broadcast capability, it is expected that DBAs will rarely need to redo the slave, and once everything is automated, DBAs will probably be very happy.
The Cost of MGR Single Primary
Of course, any good thing has a cost. Using MGR single primary mode incurs slightly higher performance overhead and latency compared to traditional binlog replication, and also has some functional constraints. These additional costs are mainly due to the certify process -- in addition to transmitting the transaction binlog, certify on the master node also needs to calculate and transmit the write set, and the slave needs to receive and store the write set (even in single primary mode). Additionally, there is extra network latency caused by the running of the Paxos protocol, which leads to slightly longer commit delays for each transaction. And to maintain MGR's high performance, half of the slave machines must be in the same data center as the master, otherwise the Paxos protocol will cause even greater delays -- though this requirement is not unreasonable.
The functional constraints of MGR refer to the fact that tables must have a master key or unique index, and MGR actually only supports the InnoDB storage engine due to the requirements of its distributed recovery and clone plugins. However, these constraints are completely acceptable -- even from the perspective of replication performance, every table should have a primary key or unique index, and TDSQL also has this requirement. Since MySQL 8.0, the data dictionary has been stored in InnoDB, so InnoDB is fully integrated with MySQL. The purpose of other storage engines seems to be limited to temporary tables or log tables (such as the general log and slow log, which are stored in CSV storage engine tables). Other transaction engines that meet specific needs -- such as myrocks, which is suitable for storing historical data with high compression rates -- need to be added to the clone plugin to be used in MGR. However, the working principle of clone is deeply bound to InnoDB, such as the modification of InnoDB's undo log to support clone functionality. Therefore, how to integrate other engines into the clone plugin is still a problem, and it requires a considerable amount of work.
Finally, MGR has constraints on the amount of transaction binlog data because if the transmission time of the transaction binlog exceeds the Paxos protocol timeout, other nodes will find that they have not received any messages from the master node and mistakenly assume that the master node has crashed. For the same reason, even though TDSQL does not use the Paxos protocol, this constraint also exists in TDSQL -- if the transaction binlog being submitted is too large and the transmission times out, the agent of the DB instance will mistakenly trigger a master-slave switch by thinking that the master node is disconnected from the slave. Therefore, the size of transaction binlogs must be limited.
My Modifications to MySQL 8.0
Unfortunately, MySQL 8.0's MGR still has many flaws in XA transaction binlog disaster recovery, which I have already fixed in TDSQL-Percona-MySQL-5.7.17-11 and reported to the MySQL official team, along with the fixed patch. These fixes have been thoroughly tested by the TDSQL team and verified by a large number of Tencent users both internally and externally, proving to be reliable. However, as of MySQL 8.0.18, the MySQL official team has not made any fixes, and these bugs still exist in MySQL 8.0 and have created new problems due to the addition of some new features. I spent a lot of time fixing these bugs and passed MySQL's test package and disaster recovery test. As a result, Klustron now has an unbreakable disaster recovery capability.
In addition, I also made some modifications to MySQL to meet the overall requirements of Klustron, but I won't go into detail here.
The single primary mode of MGR is not suitable for direct use by users because their programs need to adapt to the main node changes caused by master/slave switching, as well as perform tasks such as shard cluster management (such as starting and stopping clusters, which can be quite complex) that can increase development difficulty and lead to errors. MySQL official documentation therefore recommends using MySQL router or InnoDB cluster in conjunction with MGR. However, due to the flaws in XA transaction handling in the MySQL official version (i.e., the lack of disaster recovery capability), using MySQL router or InnoDB cluster cannot reliably write to multiple shards in the same transaction. I have not used these two tools, but I presume they do not support distributed transactions. It is unclear whether this is to avoid competition with Oracle's database products.
Klustron will use MGR single primary mode and automatically handle MGR's master/slave switching adaptation, distributed transaction disaster recovery and recovery, cluster management, and all general tasks to become a simple and easy-to-use distributed database system with complete disaster recovery capabilities for users and DBAs.
The Status and Contribution of MGR
I believe that MGR has further solidified the position of the binlog subsystem in the MySQL ecosystem and prevented it from being marginalized. It is important to note that as MySQL gradually shifted towards using the InnoDB storage engine (with the data dictionary using InnoDB and the industry abandoning MyISAM), the necessity of the binlog system decreased. Users can completely use InnoDB redo log replication to mount slave machines and achieve HA and slave machine reads, which can completely replace traditional asynchronous replication. This approach can also avoid the resource consumption caused by binlog writing and storage, especially the significant delay it causes to transaction commits. The master node serializes multiple transactions that could have been executed concurrently to generate a binlog transaction sequence, which then executes the binlog flush stage and the optional single-threaded engine-level commit stage. Although InnoDB redo log occupies more storage space than binlog (mainly due to page-level redo logs caused by B-tree page splitting), this disadvantage can be offset by compression. Considering the problems often encountered with traditional asynchronous replication, such as slave machine stall or falling too far behind the master machine, the replay of InnoDB redo log is very fast because it only modifies the page level, eliminating the overhead of upper-level code. Therefore, InnoDB redo log replication becomes more attractive. During the planning of Klustron, I once considered using only InnoDB without MySQL as an option. It is worth noting that InnoDB also comes with a simple SQL processor that can execute single-table queries and simple table joins. However, we ultimately chose to use MGR due to its many advantages.
I believe that MGR will continue to develop and further enhance the position and value of MySQL in the database industry. I also believe that Klustron will fully leverage the value of MySQL MGR and implement larger-scale, efficient, and automated data management based on MGR.