01 overview
The convergence of new technologies such as 5G and IoT has brought rapid and dramatic changes to the telecom industry. These new technologies are driving unprecedented growth in the volume, speed and diversity of data. Failure data led to the growth of the traditional system, so the telecoms companies have to face a difficult choice: continue to try to “traditional system used with existing systems”, or looking for something new, and the sooner the better, because no telecommunications company in the world of data will become more and more slowly, at the same time, number of hackers become more and more and more aggressive.
Cross-data Center replication (XDCR) was created to provide more up-to-date data in a timely manner. The technology has been around for decades, but it has become expensive to operate with traditional technology and difficult to implement properly with 5G. Active XDCR on a traditional RDBMS becomes complex and error-prone. Companies also quickly realized that they either had to completely replace their databases or risk being permanently technologically and economically disadvantaged.
In this article, we will discuss what XDCR is and why it is needed; What is active-active XDCR, and why do so many companies face this challenge? Finally, VoltDB, how to avoid conflict in an active-active XDCR, can also be fully fault-tolerant of complex streaming data and make intelligent decisions under 10 milliseconds without compromising data accuracy.
First, let’s define our terms.
02 What is XDCR?
At the most basic level, XDCR simply means that changes go in multiple directions, that is, having multiple live database clusters in multiple geographic locations at once so that changes are automatically propagated to all other replicas as local databases are updated. There are two main reasons to do this:
1. Business continuity: Disruptions are inevitable, but should not determine your fate
The initial motivation for having multiple live copies of the database in different locations is that the enterprise can continue to operate if an earthquake, fire, or other event causes the original copy to fail.
Earlier implementations had a master copy of the database and one or more remote read-only copies, which may or may not have been fully updated. Setting this copy to “live” is a manual process that could easily take 30 minutes. From a business perspective, this poses obvious problems. First, the amount of data lost can be large. Second, executives may be uncomfortable with any system that requires a complex, carefully ordered chain of manual events to work. They’re especially nervous when switching only happens during the most chaotic moments.
Finally, XDCR involves two interchangeable databases that are connected to each other, thus avoiding the whole switching process. This is known as an “active-active” architecture (see more below). But providing what the database industry calls an “active-active solution” is harder than it looks.
In practice, replication across data centers using traditional techniques has been around for more than a decade, but it can easily increase project costs tenfold while presenting enough operational challenges and “pitfalls” to make net reliability a reality. There’s nothing better than a single system.
2. 5G and latency: Your data needs to be where your customers are
If you are an enterprise with extremely latency-demanding applications, the distance between the data center and the end user becomes important.
The data travels at about 124 miles per millisecond through fiber-optic cables. So if your data center is in New York and your customer is 2,900 miles away in San Francisco, any message will take at least 23 to 46 milliseconds to get around. If you decide whether a call must be connected within 20 milliseconds, and the decision itself takes 10 milliseconds, the data center cannot be more than 5 milliseconds away (about 600 miles).
This means that you will need to back up multiple copies of your operational data in real time to operate your business in different geographic locations. In fact, you may need more than two.
03 What is active – active XDCR?
There is a lot of confusion about what XDCR is and what “active-active” means. Today, many database architects and administrators may use these terms interchangeably. This is not exactly — active-active is not the same as XDCR, because “active-active” is a way of doing XDCR. It does, however, point to the ubiquity and urgency of an active-active architecture.
So, first, we’ll define “active-active” and distinguish it from “active-passive.” Active-passive means that there are multiple copies of the database, but only one is modifiable (that is, active), while the others are informed of changes. This seems simple enough, but by setting the backup database to “active” (which often requires human intervention because the original “active” database has failed), it can’t tell the difference between the data center that burned the “active” database to the ground and the person who unplugged the network cable. Our goal was to reduce system-wide latency, which didn’t help.
Active-active means that you have two databases, both of which can be updated in real time, and both of which can communicate with each other to synchronize updates. This avoids the “when to be proactive” problem decision we discussed above, and now we can resolve conflicts (see more below).
Proactive proactive, which means adding another cluster to your deployment. And this configuration is more common than you might think. If customers have strict requirements for geographic redundancy (physical isolation of data centers across multiple geographic locations) and must also perform principal OS/hardware upgrades, the easiest approach is usually to use new equipment — configurations that can be tested before one of the older clusters is deprecated. The upshot of this is that most of our “active-active-active” clusters will become “active-active-active” at some point.
Why is active-active XDCR so difficult?
With enough “blood and treasure” (that is, in the form of developer time and additional third-party software), we can create almost any database to support some form of active-active XDCR. That’s why so many vendors claim they can do this. So the question is not “does’ product ‘meet the legal requirements to support pro-active?” Rather, “Can I afford the human, technical, and financial costs associated with deploying PRODUCT X in an active-active mode?” .
Many enterprises either give up or agonise over compromise solutions without first understanding how to implement or attempt proactive XDCR without increasing operating costs.
The core of active – active XDCR is conflict management.
To illustrate the trouble conflicts can cause, let’s assume that we are talking about prepaid mobile phone credit and that the same database is replicated in three locations (location A, B, and C), each hundreds of miles apart. Let’s assume that the system tracks about 50 million end users, each with 50-100 records of various kinds.
In this case, it is almost impossible to avoid conflict. The most obvious type of conflict is when A user connects to the database at location A and spends its last credit limit, then somehow connects to location B and spends the same amount again, and then the message for this activity reaches location A.
This may sound unlikely, but it often happens when multiple users share a prepaid calling plan. This also happens if the user is on A boundary where traffic is constantly switching between locations A and B. However, the bigger problem is not that you have a single conflict, and you can spend some time fixing all the problems and then stop all the conflicts. You may have more than one conflict, and by the time you find out, the second – and third-order consequences of bad decisions have spread throughout your enterprise.
Why can’t we just use “two-phase commit” between the two sites? Two-phase commit is a technique used to coordinate changes between multiple sites. For each transaction, it takes the form of a lengthy conversation between the sites, culminating in an agreement that the data item has changed. While it does its core task, it’s not practical for active-active XDCR for the following reason: it doesn’t scale. We need to be able to complete thousands of transactions per second, and two-phase commit simply doesn’t scale to that level. It assumes that everything is working all the time. In a two-phase commit system, we assume that all sites are always open and visible to each other. This means that in the event of a site outage or network problem, the entire system will come to a halt.
How to make active – active XDCR easier
How easy it is to handle conflicts depends on the database you use, but there are many things an enterprise can do to avoid conflicts. Since we can’t “engineer” conflicts, we need to think about reducing the frequency of conflicts and dealing with them effectively when they do occur.
You can perform the following operations:
1. Use fast propagation
There is an inverse relationship between how quickly you propagate and update from site to site and how many conflicts you will encounter. If it takes 5 seconds instead of 500 milliseconds, the number of conflicting Windows increases by a factor of 10, and your conflict count increases accordingly.
Traditional implementations of XDCR are typically built on top of change data capture (CDC) applications originally designed to populate data warehouses. Therefore, they may be batch-based, but they will be slow. They were also designed with one-way replication in mind, which can cause problems when trying two-way or multi-directional.
2. Minimize configuration and automation
Other negative consequences of traditional data center replication products are often poor combinations of databases and CDC products. One is the complexity of the underlying configuration: database objects and how they are replicated are completely decoupled at the DDL level, which has significant complexity.
And in the event of a problem, there is a general lack of automatic recovery. Even under ideal conditions, XDCR poses operational challenges regardless of the technology being used. Customer experience with XDCR shows us that even with high automation and simple configuration, human error is the most likely cause of outages.
3. Programs conflict addresses
Given that conflicts are inevitable, we need to address their consequences in real systems. This must be done in an automated manner, because while the average number of conflicts per hour is small, network outages can lead to a large number of conflicts that overwhelm human decision makers.
This means a standard format for conflicting messages, suitable for automatic analysis and processing. This is only the first step, as conflicts will occur at different times at each site in the deployment. We also need a mechanism to store conflicts.
4. Resolve conflicts quickly to avoid negative repercussions
Although most products use some form of “write last win” to determine the final appearance of data, we still have to deal with the downstream consequences of decisions made after conflicts occur but before we are aware of them. In practice, this means that without a procedural mechanism for resolving conflicts, you won’t be able to resolve the inevitable before the end user notices the inevitable. If you resolve conflicts quickly, you will be able to minimize the second – and third-order effects of inaccurate decisions that otherwise lead to chaos.
5. Adapt harshly
Once we accept that conflict is inevitable and we need to resolve it automatically, another requirement suddenly becomes obvious: any conflict we observe and attempt to resolve must be complete and complete. It is useless to identify about half of the conflicts because we will not be able to resolve any potential business problems caused by the conflicts. So we need a mechanism to report conflicts that meets stringent requirements, and you can safely assume that once you’re told about a conflict, you know all about it, not just some of it.
6. Automatic recovery is supported
Spend any time trying to deal with distributed in LAN or geographic database will tell you, when you are speaking of nodes die to switch to the feasible replacement node may have problems, but the real challenge is to make the nodes to join the cluster or insufficient in the component load add new nodes to the cluster, and will not lead to a “failure” or completely interrupted.
For many older products, the rejoin process looks, feels, and behaves like an afterthought. However, if nodes cannot be rejoined with unplanned outages and subsequent drama, you will find yourself in a world where it can take tens or even hundreds of hours just to keep your system up and running.
06 VOLTDB How to Be Proactive — Proactive XDCR
In addition to being optimized for large transactional workloads, VoltDB is an ideal platform for active XDCR applications for a number of reasons.
While VoltDB has enabled passive database replication to provide parallel binary replication between primary and replica clusters, Version 6 of VoltDB introduces XDCR and active-active database replication, otherwise known as active-active replication. XDCR supports bidirectional active replication across two database clusters. With this feature, VoltDB offers the ability to maintain separate, synchronized, writable databases in two different locations.
VoltDB routes incoming requests to deterministic queues, collectively known as “command logs”, each of which is used by a single-threaded processing engine called a partition. On each server, there is usually a 1:1 relationship between partitions and CPU cores, so each kernel is busy processing a transaction quickly.
As they process transactions, binary changes they make to the database are written to a so-called “Binlog.” Binlogs differ from the “redo logs” in traditional RDBMSS, which are limited to describing changes to rows. Instead, it contains metadata we need to resolve in the event of a conflict:
A key aspect of Binlog is its ACID rigor semantics: If I update two rows, the Binlog either contains two changes or none. As we’ll discuss later, we see this as a key requirement for any XDCR system.
When you write a Binlog, it streams to the desired destination location. Each partition uses one stream, which means we can scale. If the link between sites breaks, the changes are buffered until returned.
Once at its destination, it will process multiple Binlog streams and apply the changes they contain to the local surface (provided they do not cause conflicts). We can detect conflicts because in XDCR mode, we store the last modified timestamp for each row, so we can see if it exists.
VoltDB does two things in a conflict:
- It takes the changed timestamps and resolves them automatically by comparing the numeric ids of the clusters involved (if the timestamps are the same).
- It reports their presence to the event flow, which is often connected to Kafka. This means that you can write code to resolve conflicting events and figure out how to mitigate the consequences within seconds of a collision.
Both need to provide a trusted and manageable activity-active XDCR solution. VoltDB not only meets the necessary requirements for Active-Active XDCR, but we make it both reliable and cost-effective by offering all of the above that makes active-Active XDCR work well, including:
1. Automation of minimum configurations
Unlike other database tech products and services, VoltDB has the least leverage. Active — Active XDCR is VoltDB’s integration. Once our active-active XDCR is up and running, it usually stays that way.
2. Consistent high performance
Typically, changes propagate in about 400 milliseconds, plus network time. This 400 milliseconds is roughly split between ensuring that the changes are recorded on local disk in the source environment, disconnecting the destination stream, and applying the changes to the target environment.
3. Automatic recovery
If a node in a single VoltDB cluster fails, rejoining will automatically resynchronize. Alternatively, you can manually add new replacement nodes. VoltDB will continue to work on client requests, and that’s ongoing. The resynchronization node will get the required data from the surviving node and will rejoin without significant impact on latency using active-active XDCR when the connection is restored, the swarm automatically resynchronizes.
4. Support programs to resolve conflicts
As mentioned above, VoltDB automatically identifies conflicting transactions when dealing with change streams from other sites. Conflicts are jsonized and appear in the export flow at each site, making them suitable for quick automation.
5. Strict compliance
Active – Active XDCR system ACID critical features for stringent compliance. Resolving procedural conflicts becomes nearly impossible if developers do not know whether they are looking at completely conflicting or partially conflicting transactions.
Real application cases of proactive XDCR implementation
1.XDCR is used for telecom billing
Telecom service providers use XDCR to manage account balances in near real time. For example, mobile operators in Europe run an application that checks a user’s account balance every time he or she makes a call. Depending on the user’s location, the request is routed to the nearest data center, where a balance check is performed and a response is quickly returned. XDCR is responsible for copying changes to a remote database for asynchronous backup. The account balance check must be completed within 200 milliseconds, so the wait time for implementation is minimal. All of this can be seen in VoltDB.
2.XDCR is used by financial services organizations
Financial services companies are also turning to XDCR to ensure transaction consistency and low latency. For example, banks can implement XDCR between east and West coast data centers to support credit card transactions. If a user in California needs to get approval for a credit card transaction, and traffic to the Los Angeles data center is high at that moment, the bank can avoid a delayed or unnecessary drop in transactions by automatically rerouting the transaction to its New York data center. Similarly, the trading center can implement XDCR to ensure that orders are entered at the right time when the data center is overloaded.
VoltDB is the only data platform built for large-scale active-active XDCR without increasing costs or compromising data accuracy. Do you see VoltDB? Do it now! Welcome private letter, and more partners to discuss together.
About VoltDB VoltDB supports strong ACID and real-time intelligent decision making apps to enable connected worlds. No other database product like VoltDB can fuel an app that demands a combination of low latency, massive scale, high concurrency and accuracy at the same time. Founded by 2014 Turing Prize winner Dr Mike Stonebraker, VoltDB has redesigned relational databases to address today’s growing real-time manipulation and machine learning challenges. Dr. Stonebraker has been researching database technology for more than 40 years and has brought many innovations in fast data, streaming data and in-memory databases. During VoltDB’s development, he realized the full potential of using in-memory transaction database technology to mine streaming data, not only to meet the latency and concurrency requirements of processing data, but also to provide real-time analysis and decision-making. VoltDB is a trusted name in the industry and has worked with leading organisations such as Nokia, Financial Times, Mitsubishi Electric, HPE, Barclays and Huawei.