Designing data-intensive Applications Chapter 1: Reliability, Scalability, and Maintainability

The Internet has done so well that most people see it as a natural resource, like the ocean, rather than an artifact. When was the last time you had this kind of large-scale, error-free technology?

– Alan Kay in An interview with Dr. Dobb magazine (2012)

[TOC]

Many applications today are data-intensive rather than compute-intensive. As a result, CPUS are rarely the bottleneck for such applications, and the larger problem is usually the volume of data, the complexity of the data, and the speed at which the data changes.

Data-intensive applications are often built from standard components that provide many common functions; For example, many applications require:

Database

Store data so that you or other applications can find it again later

Cache

Remember the results of expensive operations to speed up reads

Search indexes

Allows users to search for data by keyword or filter data in a variety of ways

Stream processing

Send messages to other processes for asynchronous processing

Batch processing

Periodically compress large amounts of accumulated data

If these features sound mundane, it’s because these data systems ** are very successful abstractions that we use without thinking and take for granted. Most engineers don’t want to write a storage engine from scratch, because a database is a perfect tool for developing applications.

But it’s not that simple. Different applications have different needs, so the database system is also a hundred flowers bloom, with a variety of characteristics. There are a number of ways to do caching, a number of ways to do search indexes, and so on. Therefore, before developing an application, it is important to understand the tools and methods that are best suited to the task at hand. And when a single tool doesn’t solve your problem, you’ll find it difficult to use them in combination.

This book will be a journey through the principles, practices, and applications of data systems, as well as a methodology for designing data-intensive applications. We will explore the commonalities and features of different tools and how they are implemented.

This chapter will start with the basic goal we want to achieve: reliable, scalable, and maintainable data systems. We will clarify what these terms mean and outline ways to consider these goals. And review some of the basics required for subsequent chapters. In the sections that follow, we’ll delve into the design decisions that can be made when designing data-intensive applications.

Thinking about data systems

We generally think of tools like databases, message queues, caches, and so on as falling into several distinct categories. Although databases and message queues have some superficial similarities — they both store data for a period of time — they have very different access patterns, meaning very different performance characteristics and implementations.

So why are we lumping these things together under the umbrella of data systems?

In recent years, there are many new data storage tools and data processing tools. They are optimized for different application scenarios and therefore no longer fit neatly into traditional categories [1]. The lines between categories are becoming blurred, for example: data stores can be used as message queues (Redis), and message queues have database-like persistence guarantees (Apache Kafka).

Second, more and more applications have a wide range of stringent requirements, and a single tool is not sufficient to meet all data processing and storage requirements. Instead, the overall effort is broken down into a series of tasks that can be efficiently accomplished by a single tool and stitched together by application code.

For example, if caching (an application-managed caching layer, Memcached or its equivalent) and full-text search (a full-text search server, such as Elasticsearch or Solr) are separated from the primary database, it is usually the application code’s responsibility to keep the cache/index in sync with the primary database. Figure 1-1 shows what this architecture might look like (details will be covered in a later section).

Figure 1-1 A possible data system architecture using multiple components in combination

When you combine multiple tools to provide a service, the service’s Interface or Application Programming Interface (API) usually hides these implementation details from the client. At this point, you’ve basically created a new, specialized data system using smaller, generic components. This new composite data system may provide certain guarantees, such as that the cache will be invalidated or updated on writes so that external clients get consistent results. Now you are not only an application developer, but also a data system designer.

When designing a data system or service, you may encounter many thorny problems, such as: how to ensure the correctness and integrity of data when the system goes wrong? How to provide customers with consistent performance when part of the system degrades? When the load increases, how to expand capacity? What is a good API?

There are many factors affecting data system design, including the skills and experience of the participants, historical legacy problems, system path dependence, delivery time, risk tolerance of the company, regulatory constraints, etc., which need to be analyzed on a case-by-case basis.

This book focuses on three issues that are important in most software systems:

Reliability

The system works well (performed correctly and achieved the desired level of performance) in adversity (hardware failure, software failure, human error).

Scalability

There are reasonable ways to cope with the growth of the system (data volume, traffic, complexity) (see “Scalability”)

Maintainability

Many different people (engineers, operations) in different life cycles can work efficiently on the system (keeping the system in its current behavior and adapting to new application scenarios). (See “Maintainability”)

People often pursue these words without a clear understanding of what they mean. For the sake of engineering rigor, the rest of this chapter explores the implications of reliability, extensibility, and maintainability. The various techniques, architectures, and algorithms used to achieve these goals will be examined in subsequent chapters.

reliability

People have an intuitive idea of whether something is reliable or not. Typical expectations for reliable software include:

The application performs what the user expects.
Allow users to make mistakes and use the software in unexpected ways.
Under the expected load and data volume, the performance meets the requirements.
The system protects against unauthorized access and abuse.

If all this together means “working correctly,” then reliability can be loosely interpreted as “continuing to work correctly even when things go wrong.”

The cause of an error is called a fault. A system that can anticipate and respond to a fault can be fault-tolerant or resilient. The term “fault tolerance” can be misleading because it implies that the system can tolerate all possible errors, but in practice this is impossible. For example, if the entire earth (and every server on it) were swallowed by a black hole, the ability to tolerate such errors would require hosting the web in space — good luck getting that budget approved. So when you’re talking about fault tolerance, it only makes sense to talk about specific types of errors.

Note that a fault is not the same as a failure [2]. Failure is usually defined as a part of the system that deviates from its standard, whereas failure is when the system as a whole stops providing services to users. The probability of failure cannot be reduced to zero, so it is best to design a fault tolerant mechanism to prevent failure caused by failure. This book will introduce several techniques for building reliable systems from unreliable components.

Counterintuitively, in such fault-tolerant systems, it makes sense to increase failure rates by deliberately triggering, for example, killing individual processes at random without warning. Many high-risk vulnerabilities are actually caused by poor error handling [3], so we can deliberately cause failures to ensure that fault tolerance mechanisms are constantly running and tested, thus increasing confidence that the system can handle failures correctly when they occur naturally. Netflix’s Chaos Monkey [4] is an example of this approach.

Although we are generally more inclined to tolerate errors than prevent errors. But there are cases where prevention is better than cure (for example, when no cure exists). Such is the case with security. For example, if an attacker has compromised the system and gained access to sensitive data, it cannot be undone. But this book focuses on the kinds of failures that can be recovered, as described in the following sections.

A hardware failure

When you think about the causes of system failures, hardware faults always come to mind first. The hard disk crashes, the memory fails, the equipment room is powered off, someone unplugs the wrong network cable… Anyone who has worked with large data centers will tell you: Once you have a lot of machines, these things happen!

It has been reported that the ** mean time to failure (MTTF) of hard disks ** is approximately 10 to 50 years [5] [6]. So mathematically speaking, in a storage cluster with 10,000 disks, on average one disk will fail every day.

To reduce the failure rate of the system, the first response is usually to increase the redundancy of individual hardware. For example, disks can be formed into RAID, servers may have dual power supplies and hot-swappable cpus, data centers may have batteries and diesel generators as backup power, and redundant components can take over immediately when a component fails. This approach doesn’t completely prevent system failures caused by hardware problems, but it’s easy to understand and often enough to keep a machine running for years.

Until recently, hardware redundancy was sufficient for most applications, making it quite rare for a single machine to fail completely. As long as you can quickly restore a backup to a new machine, downtime isn’t catastrophic for most applications. Only a small number of applications where high availability is critical will require multiple sets of hardware redundancy.

However, with the increase of data volume and application computing requirements, more and more applications begin to make heavy use of machines, which will correspondingly increase the hardware failure rate. It is also common for virtual machine instances to become unavailable without warning on cloud platforms such as AWS [7] because cloud platforms are designed to prioritise flexibility and elasticity. Not stand-alone reliability.

If software fault tolerance is added to hardware redundancy, the system is one step closer to tolerating entire machine failures. Such systems also have operational advantages, such as planned downtime for single-server systems if a machine needs to be restarted (for example, to apply an operating system security patch). Systems that allow machines to fail can be repaired one node at a time without requiring an entire system shutdown.

Software error

We often think of hardware failures as random and independent: just because a disk fails on one machine doesn’t mean a disk on another machine will fail. A large number of hardware components cannot fail at the same time unless they are weakly correlated (the same reason that causes the correlation error, such as the temperature of the server rack).

Another type of error is internal systematic error [7]. Such errors are unpredictable and, because they are cross-node related, tend to cause more system failures than unrelated hardware failures [5]. Examples include:

A BUG that causes all application server instances to crash when certain incorrect input is accepted. The leap second on June 30, 2012, for example, crashed many applications at the same time due to a bug in the Linux kernel.
Runaway processes consume shared resources, including CPU time, memory, disk space, or network bandwidth.
The services on which the system depends become slow, unresponsive, or start returning incorrect responses.
Cascading faults, in which a small fault in one component triggers a fault in another component, which triggers more failures [10].

The bugs that cause such software failures often lie dormant for a long time until they are triggered by an exception. This means that the software makes some assumption about its environment that is generally true, but for some reason no longer holds [11].

While there is no quick fix for systemic failures in software, there are a number of small fixes that can be taken, such as: carefully considering the assumptions and interactions in the system; A thorough test; Process isolation; Allow processes to crash and restart. Measure, monitor and analyze system behavior in production environment. If the system can provide some assurance (for example, in a message queue, the number of incoming messages equals the number of outgoing messages), the system can constantly check itself at run time and raise an alarm in case of discrepancies (In Favor) ** [12].

Human error

The engineers who design and build the software systems are human, as are the operations that keep them running. Even with the best of intentions, humans are unreliable. For example, a study on large Internet services found that operation and maintenance configuration errors were the primary cause of service interruption, while hardware failures (server or network) only caused 10-25% of service interruption [13].

Even though humans are unreliable, what can be done to make systems reliable? The best systems use a combination of the following:

Design the system in a way that minimizes the chance of making mistakes. For example, well-designed abstractions, apis, and administrative backends make it easier to get things right and harder to get things wrong. But if interfaces are too restrictive, people will deny their benefits and find ways around them. It’s a tricky balance to get right.
(5) Decouple the areas where people are most likely to make mistakes from those that may cause them to fail. In particular, providing a fully functional non-production environment sandbox allows people to safely explore and experiment with real data without affecting real users.
Conduct thorough testing at all levels [3], from unit testing, system-wide integration testing to manual testing. Automated tests are easy to understand and are already widely used, especially for covering corner cases that are rare in normal situations.
Allow easy and quick recovery from human error to minimize the impact of failure conditions. For example, quickly roll back configuration changes, release new code in batches (so that any unexpected errors affect only a small number of users), and provide data recalculation tools (in case older calculations fail).
Configure detailed and unambiguous monitoring, such as performance metrics and error rates. In other engineering disciplines this refers to telemetry. (Once the rocket is off the ground, telemetry is crucial for tracking what’s happening and understanding failures.) Monitoring can give us early warning signals and allow us to check if there are any violations of assumptions and constraints. When problems occur, metrics data can be invaluable for problem diagnosis.
Good management practices and adequate training — a complex and important aspect, but beyond the scope of this book.

How important is reliability?

Reliability is not just for nuclear power plants and air traffic control software; we expect it to work reliably for many mundane applications. Errors in business applications can lead to lost productivity (and perhaps legal risks with incomplete data reporting), while disruptions to e-commerce sites can lead to significant losses in revenue and reputation.

Even in “non-critical” applications, we have a responsibility to our users. Imagine a parent storing all their photos and videos of their children in your photos app [15]. How would they feel if the database suddenly became corrupted? Are they likely to know how to recover from a backup?

In some cases, we may choose to sacrifice reliability to reduce development costs (such as prototyping products for an unproven market) or operating costs (such as extremely low-margin services), but we should be aware of what we are doing when we cut corners.

scalability

Just because a system works reliably today doesn’t mean it will work reliably in the future. A common reason for the degradation of a service is increased load. For example, the system load has grown from 10,000 to 100,000 concurrent users, or from one million to ten million. Perhaps the magnitude of data being processed is much larger than in the past.

Scalability is a term used to describe a system’s ability to cope with Scalability growth. Note, however, that this is not a one-dimensional label affixed to the system: saying “X is scalable” or “Y is not scalable” doesn’t make any sense. Instead, talking about scalability means thinking about things like “What are the options for dealing with growth if the system grows in a particular way?” “And” How can computing resources be increased to handle the additional load?” Wait for a problem.

Describe the load

Discuss growth (what happens if the load doubles?) Be able to briefly describe the current load on the system. Loads can be described by numbers called load parameters. Depending on the system architecture, the best choice of parameters could be requests to the Web server per second, read/write ratios in the database, number of simultaneous active users in the chat room, cache hit ratio, or something else. Beyond that, maybe averages are important to you, or maybe your bottleneck is a few extreme scenarios.

To make this concept more concrete, we take the data released by Twitter in November 2012 [16] as an example. Twitter’s two main businesses are:

Release tweets

Users can post new messages to their followers (average 4.6K requests/SEC, peak over 12K requests/SEC).

Homepage Timeline

Users can look up tweets from people they follow (300K requests/second).

Handling 12,000 writes per second (peak tweet rate) is straightforward. Twitter’s scalability challenge, however, comes not primarily from the volume of tweets, but from fan-outs — where each user follows many people and is followed by many people.

[^ii]: Fan out: term borrowed from electrical engineering, describing the number of logical gates in which the input is connected to the output of another gate. The output needs to provide enough current to drive all connected inputs. In transactional systems, we use it to describe the number of requests that need to be performed for other services in order to service one incoming request.

In general, this pair of operations can be implemented in two ways.

To post a tweet, simply insert the new tweet into the global tweet collection. When a user requests a timeline for their homepage, they first look up all the people they follow, query the tweets from those users and merge them chronologically. In a relational database, as shown in Figure 1-2, you can write queries like this:
```
SELECT tweets.*, users.* 
  FROM tweets 
  JOIN users   ON tweets.sender_id = users.id 
  JOIN follows ON follows.followee_id = users.id 
  WHERE follows.follower_id = current_userCopy the code
```
Figure 1-2 Simple implementation of the relational mode of twitter home page timeline

Maintain a cache for each user’s home page timeline, like each user’s Tweet inbox (Figure 1-3). When a user posts a tweet, look up all the people who follow that user and insert the new tweet into each homepage timeline cache. So the request to read the home page timeline is minimal because the results are already calculated in advance.

Figure 1-3 Data pipeline for distribution of Tweets to followers, Load parameters in November 2012 [16]

The first version of Twitter used method 1, but the system struggled to keep up with the load of home page timeline queries. So the company turned to method 2, which worked better because tweets were almost two orders of magnitude less frequently than queries to the home timeline, so in that case it was better to do more work on writes and less work on reads.

The downside of Method 2, however, is that tweeting now requires a lot of extra work. On average, a tweet is sent to about 75 followers, so 4.6 KB of tweets per second becomes 345 KB of tweets per second to the home timeline cache. But this average hides the fact that the number of followers users have varies widely, with some users having more than 30 million followers, meaning that a single tweet could result in 30 million writes to the timeline cache on the homepage! Pulling this off in time was a huge challenge — Twitter tried to send tweets to followers in five seconds.

In the case of Twitter, the distribution of followers per user (possibly weighted by how often those users tweet) is a key load parameter to discuss scalability, as it determines the fan out load. Your application may have very different characteristics, but you can use similar principles to think about your load.

Final twist on twitter’s anecdote: Now that Method 2 has been robustly implemented, Twitter has shifted to a hybrid of the two methods. Most tweets are still written to the fan’s timeline cache. But a small number of users with large numbers of followers (celebrities) were excluded. When the user reads the timeline of the homepage, the tweets from the celebrities they follow are individually pulled and merged with the user’s timeline cache, as shown in Method 1. This hybrid approach provides consistently good performance. We’ll revisit this example in Chapter 12, when we’ve covered more of the technical aspects.

Describe the performance

Once the load on the system can be described, what happens when the load increases can be investigated. There are two ways to look at it:

What happens to system performance when you increase load parameters and keep system resources (CPU, memory, network bandwidth, and so on) constant?
How many system resources do YOU need to add when you add load parameters and want to keep performance constant?

Both of these issues require performance data, so let’s briefly look at how to describe system performance.

For batch systems like Hadoop, the usual concerns are throughput, the number of records that can be processed per second, or the total time it takes to run a job on a data set of a particular size [^ III]. For online systems, what is usually more important is the response time of the service, which is the time between the client sending the request and receiving the response.

[^ III]: Ideally, the running time of a batch job is the size of the data set divided by the throughput. In practice, running times tend to be longer due to data skew (data is not evenly distributed across each worker process), which requires waiting for the slowest task to complete.

Latency and response time

Latency and response time are often used synonymously, but they are not the same. Response time is what the customer sees, including network latency and queuing latency in addition to the time it takes to actually process the request (service time). Latency is the duration of a request waiting to be processed, during which it is in dormant and waiting for service [17].

Even if you send the same request over and over again, you get slightly different response times each time. Real-world systems handle a wide variety of requests, and response times can vary widely. So we need to think of response time as a measurable distribution, not as a single number.

In Figure 1-4, each gray bar table represents a request for the service, and its height indicates how long the request took. Most requests are fairly fast, but there are occasional outliers that take much longer. This may be because slow requests are substantially more expensive; for example, they may process more data. But even if all the request (you think) spend the same time, under the condition of random changes may lead to additional delay result, for example, a context switch to the background processes, network packet loss and TCP retransmission, garbage collection pauses, forced to read from disk page faults, vibration [18] in the server rack, there are many other reasons.

Figure 1-4 shows the mean and percentile response times of 100 requests for a service

Typically, reports show the average response time of services. (The word “average” does not strictly refer to any particular formula, but in practice it is commonly understood as the arithmetic mean: given n values, add them up and divide by n). However, if you want to know the “typical” response time, the average is not a very good indicator because it does not tell you how many users actually experience this delay.

It is usually better to use percentiles. If you sort the list of response times from fastest to slowest, the median ** is right in the middle: for example, if your median response time is 200 milliseconds, that means half the requests return less than 200 milliseconds, and the other half takes longer than that.

If you want to know how long a user has to wait in a typical scenario, the median is a good metric: half of the user requests have less than the median response time, and the other half have a longer service time than the median. The median is also known as the 50th percentile, sometimes abbreviated to P50. Note that the median is for a single request; If a user makes several requests at the same time (during a session, or because a page contains multiple resources), the probability of at least one request being slower than the median is much greater than 50%.

To get a sense of how bad outliers are, look at higher percentiles, such as the 95th, 99th, and 99.9th percentiles (abbreviated p95, P99, and P999). They mean that 95%, 99%, or 99.9% of requests respond faster than this threshold, for example: if the 95th percentile response time is 1.5 seconds, that means that 95 out of 100 requests respond faster than 1.5 seconds, and 5 out of 100 requests respond faster than 1.5 seconds. See Figure 1-4.

The high percentil of response time (also known as tail percentil) is important because they directly affect the user’s service experience. Amazon, for example, describes response time requirements for internal services as 99.9 percentiles, even though it affects only one in a thousand requests. This is because the customers with the slowest response to requests are also the ones with the most data, and arguably the most valuable — because they pay [19]. Keeping the site responsive is very important to maintaining customer satisfaction. Amazon observed that an increase in response time of 100 ms resulted in a 1% decrease in sales [20]; Others report that a delay of one second reduces customer satisfaction by 16%. 21,22.

On the other hand, optimizing the 99.99th percentile (the slowest of 10,000 requests) was considered too expensive to provide sufficient benefit for Amazon’s goals. Reducing response time at high percentile sites is difficult because it is susceptible to random events that are out of control and of little benefit.

Percentage points are commonly used for service level objectives (SLOS) and service level agreements (SLAs), contracts that define the expected performance and availability of services. The SLA may declare that the service is considered to be working properly if the median response time is less than 200 milliseconds and the 99.9 percentile is less than 1 second (or sub-standard if the response time is longer). These metrics set expectations for the customer and allow the customer to demand a refund if the SLA is not met.

Queueing delay usually accounts for a large part of the response time at high percentiles. Because a server can only process a small number of transactions in parallel (e.g. limited by the number of CPU cores it has), only a small number of slow requests can block the processing of subsequent requests, an effect sometimes referred to as head-of-line blocking. Even if subsequent requests are processed very quickly on the server, clients end up seeing slow overall response times due to waiting for previous requests to complete. Because of this effect, it is important to measure the response time of the client.

When the load is artificially generated to test the scalability of the system, the client that is generating the load is constantly sending requests independent of the response time. If the client waits for the completion of the previous request before sending the next request, this behavior will have the effect of artificial queuing, which will make the queue shorter than the actual situation during the test and lead to the deviation of the measurement results [23].

Percentile sites in practice

High percentiles become particularly important in multi-invoked back-end services. Even if invoked in parallel, the end user request still needs to wait for the slowest parallel call to complete. As shown in Figure 1-5, it only takes one slow call to slow down the entire end user request. Even if only a small percentage of back-end calls are slower, the chance of getting slower calls increases if end user requests require multiple back-end calls, so a higher percentage of end user requests are slower (an effect known as tail-delay amplification [24]).

If you want to add response time percentage points to your service’s monitoring dashboard, you need to continuously calculate them efficiently. For example, you might want to keep a scrolling window of request response times within the last 10 minutes. Every minute, you calculate the median and various percentages in the window and plot those measurements on a graph.

The simple implementation is to keep a list of response times for all requests in a time window and sort the list every minute. If efficiency is too low for you, there are algorithms that can approximate percentiles with minimal CPU and memory costs (such as forward attenuation [25], T-Digest [26], or HdrHistogram [27]). Note that averaging percentages (e.g., reducing time resolution or merging data from multiple machines) does not make mathematical sense – the correct way to aggregate response time data is to add histograms [28].

Figure 1-5 When a request requires multiple back-end requests, a single back-end slow request slows down the requests of the entire end user

Ways to deal with loads

We have now discussed the parameters used to describe load and the metrics used to measure performance. It’s time to start talking seriously about scalability: How do you maintain good performance as load parameters increase?

An architecture that accommodates a certain level of load is unlikely to be able to handle 10 times that. If you are developing a rapidly growing service, you may need to rethink your architecture every time the load increases by an order of magnitude — or more often.

There is a lot of talk about the opposition between vertical scaling up (moving to more powerful machines) and horizontal scaling out (spreading the load across many small machines). Distributing load across multiple machines is also known as a “shared-nothing” architecture. Systems that can run on a single machine are generally simpler, but high-end machines can be very expensive, so very dense loads often inevitably require lateral scaling. Good architecture in the real world requires a pragmatic combination of these two approaches, as it can be simpler and cheaper to use a few sufficiently powerful machines than to use a large number of small virtual machines.

Some systems are elastic, which means they can automatically increase computing resources when they detect an increase in load, while others scale manually (manually analyzing capacity and deciding to add more machines to the system). An elastic system can be useful if the load is highly unpredictable, but scaling up the system manually is simpler and may result in fewer unexpected operations (see “Rebalancing partitions”).

Deploying stateless services ** across multiple machines is simple, but moving a stateful data system from a single node to a distributed configuration can introduce a lot of additional complexity. For this reason, common sense dictates that the database should be placed on a single node (vertically scaled) until scaling costs or availability requirements force it to be distributed.

As the tools and abstractions of distributed systems get better, this common sense may change, at least for some types of applications. You can expect distributed data systems to become the default in the future, even for scenarios that don’t handle large amounts of data or traffic. The rest of the book introduces a variety of distributed data systems, discussing not only their scalability, but also their ease of use and maintainability.

Large-scale systems architectures are often application-specific — no one-size-fits-all extensible architecture (informally: Magic Scaling sauce). The problem with an application can be the amount of reads, writes, data to store, complexity of data, response time requirements, access patterns, or a hodgepodge of all.

For example, a system handling 100,000 requests per second (each 1 kB in size) will look very different from a system handling three requests per minute (each 2GB in size), even though both systems have the same data throughput.

An extensible architecture that works well for applications is built around the assumption that ** : Which operations are common? Which operations are rare? This is called a load parameter. If the assumptions turn out to be wrong, then the engineering investment made for the expansion is wasted, and at worst, counterproductive. In early stage startups or informal products, the ability to support rapid iterations of the product is often more important than a hypothetical load that can scale into the future.

Although these architectures are application-specific, extensible architectures are often built from common building blocks and arranged in common patterns. In this book, we will discuss these artifacts and patterns.

maintainability

As we all know, most of the cost of software is not in the initial development phase, but in the ongoing maintenance phase, including fixing bugs, keeping the system up and running, investigating failures, adapting to new platforms, making changes for new scenarios, paying off technical debt, adding new features, and so on.

Unfortunately, many people in the software systems industry don’t like maintaining so-called legacy systems — perhaps because it involves fixing other people’s bugs, working with outdated platforms, or forcing the system to be used in some other job. Each legacy system is annoying in its own way, so it’s hard to make a general recommendation for dealing with them.

But we can, and should, design software in such a way that we try to minimize the pain of maintenance from the beginning so that our software systems don’t become legacy systems. To do this, we will pay special attention to three design principles for software systems:

Operability

Facilitate the operation and maintenance team to keep the system running smoothly.

Simplicity

Remove as much complexity from the system as possible to make it easy for new engineers to understand. (Note that this is not the same as user interface simplicity.)

Evolability

Enables engineers to easily make changes to the system in the future and adapt to new application scenarios as requirements change. Also known as extensibility, modifiability or plasticity.

As with reliability and scalability mentioned earlier, there is no easy solution to achieving these goals. But we’re going to try to imagine a system that’s operable, simple and evolvable.

Operability: Life is short, care for operation and maintenance

According to one, “Good operations can often bypass the limitations of junk (or incomplete) software, and even good software cannot run reliably with junk”. Although some aspects of operations can and should be automated, it is still up to people to set up the automation mechanisms that get it right in the first place.

Operations teams are critical to keeping software systems running smoothly. Typical responsibilities of a good operations team are as follows (or more) [29] :

Monitor the health of the system and quickly restore service in case of poor service status
Trace the cause of the problem, such as system failure or performance degradation
Keep your software and platform up-to-date, such as security patches
Understand the interactions between systems so that you can circumvent abnormal changes before they cause damage.
Anticipate future problems and solve them before they arise (e.g., capacity planning)
Establish deployment, configuration, and management good practices and write tools
Perform complex maintenance tasks, such as migrating applications from one platform to another
Maintains system security when configuration changes
Define workflows to make operations predictable and maintain a stable production environment.
Iron camp flow of soldiers, maintain the organization to understand the system.

Good operability means easier day-to-day work so that operations teams can focus on high-value things. Data systems can make everyday tasks easier in a variety of ways:

Provide visibility into the internal state of the system and runtime behavior through good monitoring
Provides good support for automation and integration of systems with standardized tools
Avoid reliance on a single machine (allow machine downtime for maintenance while the entire system continues to operate uninterrupted)
Provide good documentation and an easy-to-understand action model (” If you do X, Y happens “)
Provides good default behavior, but also allows administrators the freedom to override defaults if needed
Self-repair when available, but also allow administrators to manually control system state when needed
Predictable behavior minimizes accidents

Simplicity: Manage complexity

Small software projects can use simple, pleasing, expressive code, but as the project gets bigger, the code often becomes very complex and difficult to understand. This complexity slows down everyone involved in the system, further increasing maintenance costs. A software project Mired in complexity is sometimes described as a big ball of mud [30].

There has been a lot of discussion on the topic of complexity with all the possible symptoms of state space explosion, tight coupling between modules, tangled dependencies, inconsistent naming and terminology, performance hacks, special cases to bypass, and more.

When maintenance is difficult due to complexity, budgets and schedules are often overrun. There is also a greater risk of introducing errors when making changes in complex software: hidden assumptions, unintended consequences, and unintended interactions are more likely to go unnoticed when developers struggle to understand the system. Conversely, reducing complexity greatly improves the maintainability of software, so simplicity should be a key goal in building systems.

Simplifying a system does not necessarily mean reducing functionality; It can also mean eliminating additional (accidental) complexity. Moseley and Marks [32] define additional complexity as the complexity that arises from the concrete implementation rather than the complexity inherent in the problem that the system solves from the user’s perspective.

One of the best tools used to remove the extra complexity is abstraction. A good abstraction hides a lot of implementation details under a clean, easy-to-understand look and feel. A good abstraction can also be used for a wide variety of applications. Reusing abstractions is not only more efficient than reinventing many wheels, but also helps develop high-quality software. Improvements in the quality of abstract components will benefit all applications that use them.

For example, a high-level programming language is an abstraction that hides machine code, CPU registers, and system calls. SQL is also an abstraction that hides complex disk/memory data structures, concurrent requests from other clients, and post-crash inconsistencies. Of course, when programming in a high-level language, we still use machine code; We just don’t use them directly, because the abstraction of the programming language means we don’t have to worry about these implementation details.

Abstractions can help keep a system’s complexity at a manageable level, but finding good abstractions is very difficult. There are many good algorithms in distributed systems, but it is not clear what abstractions they should be packaged into.

This book will focus on good abstractions that allow us to extract parts of large systems into well-defined, reusable components.

Evolvability: Embrace change

The requirements of the system are always the same and almost impossible. More likely, they are in a state of constant change, for example: you learn new facts, unexpected application scenarios emerge, business priorities change, users demand new features, new platforms replace old ones, legal or regulatory requirements change, system growth forces architectural changes, etc.

In terms of organizational processes, Agile work patterns provide a framework for adapting to change. The Agile community has also developed technical tools and patterns, such as Test-driven development (TDD) and Refactoring, that are useful for developing software in a frequently changing environment.

Most of the discussion of these agile techniques focuses on a fairly small scale (a few code files in the same application). This book will explore ways to improve agility at the level of larger data systems, which may consist of several different applications or services. For example, how would you “refactor” the architecture of Twitter in order to change the method of assembling the homepage timeline from method 1 to method 2?

The ease with which a data system can be modified and adapted to changing requirements is closely related to simplicity and abstraction: simple and understandable systems are often easier to modify than complex ones. But because this is such an important concept, we will use a different term to refer to agility at the data system level: evolvability [34].

The summary of this chapter

This chapter explores some basic ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, which will delve into technical details.

An application must satisfy a variety of needs to be useful. There are functional requirements (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways) and nonfunctional requirements (general properties, such as security, reliability, compliance, extensibility, Compatibility and maintainability). Reliability, extensibility, and maintainability are discussed in detail in this chapter.

** Means that the system will work normally even if a failure occurs. Failures can occur in hardware (usually random and unrelated), software (usually systematic bugs that are hard to deal with), and humans (which inevitably go wrong from time to time). Fault-tolerant technologies can hide certain types of failures from end users.

Scalability means a strategy to maintain performance even when load increases. To discuss scalability, we first need a way to quantitatively describe load and performance. We briefly looked at the twitter home page timeline example, described ways to describe load, and used the response time percentile as a way to measure performance. On scalable systems, processing capacity can be added ** to remain reliable under high loads.

Maintainability ** Has many aspects, but is essentially about quality of life for engineers and operations teams. Good abstraction can help reduce complexity and make the system easy to modify and adapt to new application scenarios. Good operability means good visibility into the health of the system and effective means of management.

Unfortunately, making applications reliable, scalable, or sustainable is not easy. But certain patterns and techniques keep reappearing in different applications. In the following chapters, we will look at some examples of data systems and examine how they achieve these goals.

Later in the book, in Part 3, we will look at a pattern where several components work together to form a complete system (for example, in Figure 1-1)

reference

Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “Availability in Globally Distributed Storage Systems,” at 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2010.

Brian Beach: “Hard Drive Reliability Update — Sep 2014,” Backblaze.com, September 23, 2014.

Laurie Voss: “AWS: The Good, the Bad and the Ugly,” blog.awe.sm, December 18, 2012.

Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “What Bugs Live in the Cloud? At 5 th, “ACM Symposium on Cloud Computing (SoCC), November 2014. The doi: 10.1145/2670979.2670986

Nelson Minar: “Leap Second Crashes Half the Internet,” somebits.com, July 3, 2012.

Amazon Web Services: “Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region,” aws.amazon.com, April 29, 2011.

Richard I. Cook: “How Complex Systems Fail,” Cognitive Technologies Laboratory, April 2000.

Jay Kreps: “Getting Real About Distributed System Reliability,” empathybox.com, March 19, 2012.

David Oppenheimer, Archana Ganapathi, and David A. Patterson: Why Do Internet Services Fail, and What Can Be Done About It? , “at the 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003.

Nathan Marz: “Principles of Software Engineering, Part 1,” nathanmarz.com, April 2, 2013.

Michael Jurewitz: “The Human Impact of Bugs,” Jury. Me, March 15, 2013.

Raffi Krikorian: “Timelines at Scale,” at QCon San Francisco, November 2012.

Martin Fowler: Patterns of Enterprise Application Architecture. Addison Wesley, 2002. ISBN: 978-0-321-12742-6

Kelly Sommers: “After all that run around, what caused 500ms disk latency even when we replaced physical server?” twitter.com, November 13, 2014.

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “Dynamo: Amazon’s Highly Available Key-Value Store,” at 21st ACM Symposium on Operating Systems Principles (SOSP), October 2007.

Greg Linden: “Make Data Useful,” slides from presentation at Stanford University Data Mining class (CS345), December 2006.

Tammy Everts: “The Real Cost of Slow Time vs Downtime,” webperformancetoday.com, November 12, 2014.

Jake Brutlag: “Speed Matters for Google Web Search,” googleresearch. Blogspot. Co. UK, June 22, 2009.

Tyler Treat: “Everything You Know About Latency Is Wrong,” bravenewgeek.com, December 12, 2015.

Jeffrey Dean and Luiz Andre Barroso: “The Tail at Scale,” Communications of the ACM, volume 56, number 2, pages 74-80, February 2013. The doi: 10.1145/2408776.2408794

Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “Forward Decay: A Practical Time Decay Model for Streaming Systems,” at 25th IEEE International Conference on Data Engineering (ICDE), March 2009.

Ted Dunning and Otmar Ertl: “Computing Extremely Accurate Quantiles Using t-Digests,” github.com, March 2014.

Gil Tene: “HdrHistogram,” hdrhistogram.org.

Baron Schwartz: “Why Percentiles Don’t Work the Way You Think,” vividcortex.com, December 7, 2015.

James Hamilton: “On Designing and Deploying Internet-Scale Services,” at 21st Large Installation System Administration Conference (LISA), November 2007.

Brian Foote and Joseph Yoder: “Big Ball of Mud,” at the 4th Conference on Pattern Languages of Programs (PLoP), September 1997.

Frederick P Brooks: “No Silver Bullet — Essence and Accident in Software Engineering,” in The Mythical Man-Month, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3

Ben Moseley and Peter Marks: “Out of the Tar Pit,” at BCS Software Practice Advancement (SPA), 2006.

Rich Hickey: “Simple Made Easy,” at Strange Loop, September 2011.

Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “Analyzing Software Evolvability,” at 32nd Annual IEEE International Computer Software and Applications Conference (COMPSAC), Out 2008. Doi: 10.1109 / COMPSAC. 2008.50

The previous chapter	directory	The next chapter
Part ONE: Data system foundation	Design data-intensive applications	Chapter two: Data models and query languages