Please pay attention to the wechat public account “AI Front”, (ID: AI-front)
I joined Uber two years ago as a mobile software engineer who knew a little about the back end, worked on the app’s payments feature, and eventually rewrote the entire app
https://eng.uber.com/new-rider-app/. Then I turned to engineering management, http://blog.pragmaticengineer.com/things-ive-learned-transitioning-from-engineer-to-engineering-manager/, Responsible for the management of the team itself. This meant more exposure to the back end, as my team was responsible for many of the back end systems involved in payment.
Before JOINING Uber, I had almost no experience with distributed systems. As a traditional computer science graduate, I’ve been developing full-stack software for over a decade, but while I’m good at easel framing and discussing trade-offs, I don’t know much about distributed concepts like consistency, usability, or idempotence.
In this article, I will summarize some of the lessons I learned and applied while building a massively available distributed system (the payment system used by Uber). The system needs to process thousands of requests per second and keep some key payment functions functioning even if some components fail. Do I have enough to say? Not necessarily! But at least it makes my job easier than ever. Let’s take a look at the concepts of SLAs, consistency, data persistence, message persistence, idempotency that inevitably come up in these efforts.
Problems are almost inevitable for large systems that need to handle millions of events every day. Before starting to plan the entire system, I found it more important to determine what would be considered “healthy”. “Health” should be a truly measurable indicator. A common way to measure “health” is to use slAs: service level agreements. Some of the most common SLAs I’ve used include:
Availability: The percentage of time a service is up and running. While everyone wants to have a system that is 100% available, this is often difficult and extremely expensive to achieve. Even large critical systems like the VISA card network, Gmail, and Internet service providers can’t maintain 100% availability for as long as a year, and systems can be down for seconds, minutes, or hours. Availability of four nines (99.99%, or approximately 50 minutes of downtime per year https://uptime.is/) for many systems
This is high enough, and often such usability requires a lot of work behind the scenes. Accuracy: is it acceptable for some data in the system to be inaccurate or missing? If so, what is the maximum acceptable ratio? The payment system I work on has to be 100% accurate, which means data can never be lost. Capacity: What size of load is the system expected to support? This is usually measured in requests per second.
Latency: How long does it take for the system to respond? How quickly do 95% of requests and 99% of requests get a response? System will usually receive a lot of meaningless requests, so p95 and p99 delay https://www.quora.com/What-is-p99-latency can represent actual situation more. Why is SLAs important for large payment systems? We’re releasing a new system to replace an old one. To make the work worthwhile, the new system had to be “better” than the previous generation, and we used slAs to define our expectations. Usability is one of the most important requirements, and once you set your goals, you need to consider the trade-offs in your architecture to meet your goals.
Assuming that the number of businesses using the new system starts to grow, the load will only increase. At some point, the existing configuration may not be able to support any more load and needs to be expanded. Vertical scaling and horizontal scaling are the two most commonly used scaling methods.
Horizontal scaling is designed to add more computers (nodes) to the system, thereby obtaining more capacity. Horizontal scaling is the most common method of scaling in distributed systems, especially adding (virtual) computers to a cluster is usually just a button click away.
Vertical scaling can be interpreted as “buying a bigger/more powerful computer” or switching to a (virtual) computer with more cores, more processing power, and more memory. For distributed systems, vertical scaling is usually not an option because it is more expensive than horizontal scaling. However, some large sites, such as Stack Overflow, have successfully scaled vertically and achieved their goals
(https://www.slideshare.net/InfoQ/scaling-stack-overflow-keeping-it-vertical-by-obsessing-over-performance).
Why is scaling important for large payment systems? Decide early enough to start building a system that can scale horizontally. Although vertical scaling was possible in some cases, our payment system was already running production loads, and initially we were pessimistic that even an extremely expensive mainframe would not be able to handle current demand, let alone future demand. We also had engineers who had worked for a large payment service and had tried and failed to scale vertically with the highest capacity computer they could buy.
Usability is important for any system. Distributed systems are typically built with multiple machines with less availability. Let’s say our goal is to build a system with 99.999% availability (approximately 5 minutes of downtime per year), but the average computer/node we use is only 99.9% availability (approximately 8 hours of downtime per year). The easiest way to get the required availability is to add a large number of such machines/nodes to the cluster. Even if some of the nodes are down, others can still function, ensuring that the overall availability of the system is high enough, and even much higher than the availability of each component.
Consistency is important for high availability systems. A system is considered consistent if all nodes can see and return the same data at the same time. As mentioned earlier, in order to achieve high enough availability, we added a large number of nodes, so it was inevitable to consider the consistency of the system. To ensure that each node has the same information, they need to send messages to each other to ensure that all nodes are in sync. However, messages sent to each other may fail to arrive, may be lost, and some nodes may become unavailable.
I spend most of my time understanding and achieving consistency. There are many kinds of consistency model (https://en.wikipedia.org/wiki/Consistency_model), Distributed system is the most commonly used include Strong Consistency (Strong Consistency https://www.cl.cam.ac.uk/teaching/0910/ConcDistS/11a-cons-tx.pdf), the Weak Consistency (Weak Consistency, https://www.cl.cam.ac.uk/teaching/0910/ConcDistS/11a-cons-tx.pdf) and Eventual Consistency (Eventual Consistency http://sergeiturukin.com/2017/06/29/eventual-consistency.html). Hackernoon about eventual consistency and strong consistency (https://hackernoon.com/eventual-vs-strong-consistency-in-distributed-databases-282fdad37cf7) The contrasting article provides a very clear and practical introduction to the trade-offs that need to be made between these models. In general, the lower the consistency requirement, the faster the system, but the more likely it is to return data that is not up to date.
Why is consistency important for large payment systems? Data in the system must be consistent. But how to achieve consistency? For some parts of the system, only strongly consistent data can be used; for example, in order to know whether a payment operation has been successfully initiated, this information must be stored in a strongly consistent manner. But for other parts, especially non-business-critical parts, final alignment is usually a more reasonable course of action. For example, when displaying a history trip, it is sufficient to implement it in an ultimately consistent manner (that is, the last trip may only appear in some components of the system for a short period of time, so that related operations can return results with lower latency or less resource footprint).
Persistence (https://en.wikipedia.org/wiki/Durability_%28database_systems%29) means that once data are successfully into the store, so will always be available, even if the system of nodes in the offline, collapse or data error, Stored data should still not be affected.
Different distributed systems can achieve different levels of persistence. Some systems implement persistence at the computer/node level, some at the cluster level, and some systems do not provide such capabilities themselves. To improve persistence, some form of replication is often used: if data is stored on multiple nodes and one or more nodes fail, the data is guaranteed to be available. There is a great article (https://drivescale.com/2017/03/whatever-happened-durability/) why are introduced in the distributed system persistence so difficult to achieve.
Why is data persistence important for payment systems? For many components in systems such as payments, no data can be lost, no data is vital. To achieve data persistence at the cluster level, you need to use distributed data stores so that complete transactions can persist even if an instance crashes. Currently, most distributed data storage services, such as Cassandra, MongoDB, HDFS, or Dynamodb, support multi-layer persistence and can be configured to implement persistence at the cluster level.
Nodes in a distributed system need to perform computing operations, store data, and send messages between nodes. An important characteristic of the messages that are sent is how reliably they are transmitted. For business-critical systems, it is often necessary to ensure that no message is ever lost.
For distributed systems, it is common to use some kind of distributed messaging service to send messages, such as RabbitMQ, Kafka, etc. These messaging services can support (or can be configured to support) different levels of messaging reliability.
Message persistence means that if a node processing a message fails, the message can continue to be processed after the failure is resolved. Message persistence is usually (https://en.wikipedia.org/wiki/Message_queue), which is mainly used for message queue level in an enduring message queue, if in the process of sending a message queue (or nodes) offline, you can continue to send the message after comes back online. On the subject suggest reading this article (https://developers.redhat.com/blog/2016/08/10/persistence-vs-durability-in-messaging/).
Why is message persistence and persistence important for large payment systems? No one can afford to lose information, such as when a passenger initiates a payment for a trip. This means that the messaging system we use must be lossless: every message needs to be sent once. However, there is a huge difference in complexity between building a system that sends every message exactly once and building a system that sends every message at least once. We decided to implement a persistent messaging system that was guaranteed to be sent at least once, and chose a message bus as the basis for developing our payment system (we ultimately chose Kafka and set up a lossless cluster for that system).
Distributed systems are inevitably prone to errors, such as connections that may break halfway or requests that may time out. The client typically retries these requests. Idempotent systems ensure that no matter what happens, no matter how many times a particular request is executed, the final execution of that request is performed only once. The payment process is a good example. If a client initiates a payment request and the request has been successfully executed, but the client times out, the client may retry the same request. For idempotent systems, users don’t pay twice; But if it’s not idempotent, it probably will.
Designing an idempotent distributed system requires the use of distributed locking strategies, from which some early concepts of distributed systems were derived. Suppose you want to implement an idempotent system through Optimistic Locking to avoid concurrent updates. In order to achieve optimistic locking, the system needs to be strongly consistent so that when we perform an action we can use some type of versioning mechanism to see if another action has been initiated.
Depending on the constraints of the system and the type of operation, there are many ways to implement idempotency. The idempotent method design process is full of challenges, Ben Nadel Writing (https://www.bennadel.com/blog/3390-considering-strategies-for-idempotency-without-distributed-locking-with-ben-darfl Er.htm) describes the different strategies he uses, all of which use distributed locks or database constraints. Idempotent is perhaps one of the most overlooked problems when designing distributed systems. I’ve had a lot of situations where the whole team was frustrated because some key actions weren’t idempotent correctly.
Why is idempotence important for large payment systems? The most important point is: avoid double charges or double refunds. Given that our messaging system chooses lossless delivery at least once, we need to ensure that even if all messages are delivered multiple times, the end result must be idempotent. We finally decided to achieve the idempotent behavior we needed for the system through version control and optimistic locking, and by using strongly consistent data sources for the system.
Distributed systems typically need to store a large amount of data, much more than the capacity of a single node. So how do you store a large amount of data on a specific number of computers? At this time of the most common approach is shard (Shardinghttps://en.wikipedia.org/wiki/Shard_%28database_architecture%29).
The data will be split horizontally and allocated to different partitions using some type of hash. Although many distributed databases have data sharding capabilities, data sharding is still an interesting topic to learn in depth, especially about resharding (https://medium.com/@jeeyoungk/ how-Sharding-works-b4dec46b3f6). In 2010, Foursquare suffered a 17-hour outage due to a sharding cap. Have a good hindsight article (http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.ht Ml) tells us the whole story.
Many distributed systems require data or computing work to be replicated across multiple nodes. To ensure that all operations are completed in a consistent manner, a ballot-based approach is also required in which an operation is considered successful only after more than a certain number of nodes have achieved the same result. This process is called arbitration.
Why are arbitration and sharding important to Uber’s payment system? Sharding and arbitration, these are very common basic concepts. I came across these concepts myself while researching how to configure replication for Cassandra. Cassandra (and other distributed systems) will use arbitration (https://docs.datastax.com/en/archived/cassandra/3.x/cassandra/dml/dmlConfigConsistency.html#dm Lconfigconsistency__about-the-quorum -level) and local quorum to ensure consistency across the cluster. But this led to an interesting side effect. On several of our meetings, when enough people had arrived in the room, someone would ask, “Shall we start? What was the result of the arbitration?
Common vocabulary used to describe programming practices, such as variables, interfaces, calling methods, etc., are all based on the assumption that there is only one computer. But for distributed systems, we need to use a different approach. In describing such a system, one of the most common approach is to use the participant pattern (the Actor Model https://en.wikipedia.org/wiki/Actor_model), with the thinking of communication to understand the code. This model is very popular and fits well with our mental models of how we think. For example, in describing the specific ways in which people in an organization communicate with each other. In addition there is a popular method of distributed system description: CSP – the conversation step by step procedure (https://en.wikipedia.org/wiki/Communicating_sequential_processes).
In the participant pattern, multiple participants send messages to each other and respond to received messages. Each participant can only perform a limited number of actions, such as creating other participants, sending messages to other participants, and deciding what action to take for the next message. With a few simple rules, a complex distributed system can be described well and can heal itself when an actor crashes. If you want to learn more about this topic, Recommended reading Brian Storti (https://twitter.com/brianstorti) write 10 minutes to understand participants mode (https://www.brianstorti.com/the-actor-model/).
At present many languages have realized the participants library or framework (https://en.wikipedia.org/wiki/Actor_model#Actor_libraries_and_frameworks, For example Uber in some systems use the Akka toolkit (https://doc.akka.io/docs/akka/2.4/intro/what-is-akka.html).
Why is the participant model important for large payment systems? We have a lot of engineers working together to build this system, many of whom have a lot of experience in distributed computing. So we decided to work with a standardized distributed model and the corresponding distributed concept in order to make the best use of off-the-shelf wheels.
When building large distributed systems, the goal is often to make them more adaptable, resilient, and scalable. The pattern is similar for payment systems or other high-load systems. Many in the industry have discovered and shared best practices in a variety of situations, and Reactive architecture is the most popular and widely used in this field.
If you want to understand responsive architecture, Recommended reading response type declaration (https://www.reactivemanifesto.org/) and watch the 12 Minutes of the video (https://www.lightbend.com/blog/understand-reactive-architecture-design-and-programming-in-less-than-12-minutes).
Why is responsive architecture important for large payment systems? The Akka toolkit we used to build our new payment system was heavily influenced by responsive architecture. Many of our engineers are also familiar with responsive best practices. It is also natural to follow the principles of responsiveness and build adaptive and resilient, message-driven, responsive systems. Such a model, which allows you to step back and check progress, seems very useful to me, and I will use it for other systems in the future.
I consider myself fortunate to be a part of such a massive, distributed, business-critical rebuilding of Uber’s payment system. Working in this environment, I learned a lot about distributed concepts that I had no idea about before. Through the sharing of this article, I hope to provide some help for others to better engage in or continue to learn the knowledge of distributed systems.
This article focuses on the planning and architecture of such systems. There is still a lot of work to be done in building, deploying, and migrating and maintaining reliable operations between heavily loaded systems. Write another article about it when you get the chance.
Please pay attention to the wechat public account “AI Front”, (ID: AI-front)