MySQL split practice in Square

Square is a financial services company serving the United States, Canada, the United Kingdom, Japan, and Australia. All transaction related information is now stored in MySQL. This kind of e-commerce website, the overall scale also has some places worth learning. Let’s look at Square’s experience with MySQL.

Square currently uses MySQL on the following scale:

  • Over 4000 instances, copied based on Percona Server 5.6 using GTID+Row format

  • More than 800 physical machines

  • The total amount of data is about 1PB

  • More than 300 services constitute

  • 99.95% availability all year round. Less than 43s per day, 4 hours and 23 minutes of downtime throughout the year

The biggest success: keeping each library under a certain size (less than 1T) and refusing to have large library instances.

In the face of such a large DB cluster, we can summarize the successful experience of Square:

  • Using replication technology, there is a slave library in each IDC, and switching technology is provided in each IDC. Other slave libraries in IDC are only used to provide read capability. Because Square has MySQL master-slave structure, it is more deployed between multiple IDCs. Considering the availability of IDCs between computer rooms, it does not excessively rely on other slave libraries for business, and puts all business as much as possible on the master node. Service requests are processed in one IDC. When the current IDC is not in use, service requests are switched to other IDCs.

Reject large libraries in design.

  • High availability switching uses a combination of domain name and VIP. The primary node does more of the read and write work. Each node is in a different IDC. GTID+Row format is used for replication. If a node fails, it will be rebuilt by backup first. If the backup is completed, it will be rebuilt by backup of the secondary node under the master library.

The overall structure is as follows:

Standardize each cluster, so that more IDCs are available. This availability, in essence, means that multiple IDCs provide disaster recovery capability, while other IDCs do not provide write capability.

  • Percona’s Xtrabackup backup is used on the backup. Perform full backup every day, encrypt the backup and divide it into 4 gb blocks for storage. Monitor the backup work and report backup failures. The backup is stored on the local RAID controller card or on the cloud disk. Every day, random samples are used to check whether the backup is available. Backup can also be used to restore faulty nodes, for example, nodes that fail to be backed up and rebuilt.

Square’s recommendations for using MySQL:

1. Consider that all nodes may fail

2. Management should be considered in every process

3. MySQL Replcaiton is a good architecture

4. Pay attention to the time control of automatic operation and maintenance (reject super large instances)

5. Keep each grouping unit independent

6. Alarm notifications can be sent to DBAs or related developers in a timely manner

7. Select a stable and trusted version (unified version)

8. Use MySQL as a queue or Cache

9. HA environment, regular testing should be carried out in production

10. Consider splitting from the application layer

11. Limit the split DB size of each application to 1TB

12. Keep clusters small and reject large instances

13. Automate every operation

Check whether MySQL is alive or dead, and whether the replication is normal

15. Monitoring indicators use PMM, VividCortex and other relevant tools as well as internal indicators

16. Periodically use pt-table-checksum to check whether the primary and secondary data are consistent.