With the development of information technology, the database has become an essential tool for the normal operation of enterprises. All the data of enterprises are stored in the database, so it can be said that the reliability of the database is related to the survival of enterprises.
Therefore, the protection and backup of data is the most important part of the database business, and the availability of the system and the reliability of the data are crucial to the database.
The availability of a database is the ability of a database to maintain its normal function when used under specified conditions. Its quantitative parameter is availability, which represents the probability of maintaining its normal business function at a certain time when the database product is used under specified conditions. Highly available databases provide automatic failover to backup quickly and automatically, so that users and applications can continue to work without interruption.
Data consistency refers to whether the logical relationship between the associated data of the database is correct and complete. Data consistency can ensure the consistency of data results obtained by users.
For TcaplusDB, the most important thing to design the usability and data consistency of database system is to meet the needs of users. Only by starting from the needs of users can we design the best database.
TcaplusDB introduces TcaplusDB’s high availability, data security, and disaster recovery mechanism.
High availability
TcaplusDB components are deployed in high availability mode by default:
- The tCAPCenter of the management node is deployed in Master/Slave mode. If the Master node fails, the tCAPCenter automatically switches to the Slave node.
- Multiple processes are deployed on the management node tCAPdir.
- Tcaproxy at the access layer adopts redundancy mode. If a single a-layer node fails, user request processing will not be abnormal.
- The TCAPSVR storage layer works in Master/Slave mode. Tcapsvr Master/Slave at the storage layer and TCAProxy at the access layer It is preferred to deploy tCAProxy across equipment rooms in the same city or across racks, switches, and floors.
Data consistency assurance
For TcaplusDB, there are perfect data consistency guarantee measures, as shown below:
- In normal read and write scenarios, the active and standby nodes use the binlog to ensure data consistency. The active and standby nodes execute the binlog in a strictly consistent time sequence. The time difference between master and slave is about 10ms. Service read and write requests are executed on the primary node.
- Active/standby switchover: The system performs an active switchover after data synchronization is complete. Failover if the master process does not exist, about 10ms of data may be lost. In this case, the old requests are still connected to the original master. TcaplusDB active-standby synchronization currently uses an asynchronous write mechanism. At present, the internal and external customers used for this kind of loss is still in an acceptable range, will not cause too much impact on the business. In view of this situation, the project team is also planning to design a strong synchronization mechanism to ensure that data will not be lost, but it will sacrifice some throughput.
- Periodic full data consistency comparison: Performs a full data consistency comparison during off-peak periods based on user requirements. During the comparison process, the system automatically determines and verifies the inconsistency between front-end read and write products based on the record modification time to discover potential inconsistency risks in the system. In order to ensure the efficiency of comparison, it is common practice to spot check partial data fragments of some core tables for full comparison.
- Cold backup data consistency guarantee: When the standby node performs full cold backup, the full data file at the cold backup start point is in the static state. In this case, the full data is backed up by byte copy, ensuring data consistency. During the cold standby period, front-end read and write operations are not affected at all. New requests are written to small change sets, and full data and small change sets are combined.
- Data landing security: CRC checks are performed on service data when a storage node lands. If data is tampered with, CRC checks fail and incorrect data is not returned to users.
The disaster recovery
TcaplusDB API maintains a consistent Hash ring. When adding or removing access layer nodes, TcaplusDB API automatically adjusts access layer TCAProxy information.
- Abnormal at the access layer: TcaplusDB sends heartbeat messages to tCAProxy at the access layer every second. If the node at the access layer does not respond within 10 seconds, the TcaplusDB API proactively indicates that the node is unavailable and uses another node.
- Storage layer exception: TCAPsvR Slave will be replaced by a new Slave node when an exception occurs. If the TCAPsvr master is abnormal, the Slave will switch to master. During the switchover, user requests fail. You are advised to add retry logic codes. Master/Slave supports subhealth switchover. When the read/write error rate reaches the threshold (80% by default), the Master/Slave switchover is performed, as shown in the following figure:
Disaster Recovery Diagram
Tcaproxy at the access layer and TCAPSVR at the storage layer both provide overload protection. If the number of read/write requests exceeds the number of reserved read/write requests, error codes are returned.
The last
Now that we’ve looked at TcaplusDB’s high availability and data security, as well as its disaster recovery mechanism, we’ll uncover more of the special mysteries of TcaplusDB design.