Original address: github.com/donnemartin…

The Nuggets translation Project

Translators: XatMassacrE, L9m, Airmacho, Xiaoyusilen, Jifaxu

Please continue to follow the Chinese maintenance link for the latest content.

Introduction to System Design

translation

Interested in translation? Here’s a translation in progress:

Brazilian Portuguese
Simplified Chinese (completed)
Turkish

purpose

Learn how to design large systems.

Prepare for the system design interview.

Learn how to design large systems

Learning how to design scalable systems will help you become a better engineer.

System design is a broad topic. Resources on system design principles are plentiful on the Internet.

The repository is an organizational collection of these resources that can help you learn how to build scalable systems.

Learn from the open source community

This is the first release of a continuously updated open source project.

Contributions are welcome!

Prepare for the system design interview

In addition to code interviews, system design is a necessary part of the technical interview process at many tech companies.

Practice common system design interview questions and match your answers with examples: discussion, code, and diagrams.

flashcards

The stack of flashcards provided here uses spaced repetition to help you memorize key system design concepts.

System design card pile
System design practice card stack
A stack of practice cards for object-oriented design

It can be used anytime, anywhere.

Code Resources: Interactive programming challenges

Are you looking for resources to prepare for a programming interview?

Check out our sister warehouse interactive programming challenge, which includes an additional stack of flashcards:

Legal code certificate heap

contribution

Learn from the community.

Welcome to submit PR for help:

Fixed error
Complete section
Add a section

Some of the content that still needs to be improved is being improved.

See the contribution guide.

Index of system design topics

A summary of various system design topics, including pros and cons. Each subject faces trade-offs and trade-offs.

Each chapter contains links to more resources.

System Design Theme: Start here
- Step 1: Review the extensibility video lecture
- Step 2: Review the extensibility article
- Next steps
Performance and extensibility
Latency and throughput
Availability and consistency
- Theory of CAP
  - CP – Consistency and partition fault tolerance
  - AP – Availability and partition fault tolerance
Consistent patterns
- Weak consistency
- Final consistency
- Strong consistency
Available model
- failover
- copy
The domain name system
CDN
- CDN push
- CDN pull
Load balancer
- Work to active-passive Switchover
- Dual-work switchover (active-active)
- Layer 4 load balancing
- Layer 7 load balancing
- Horizontal scaling
Reverse proxy (Web server)
- Load balancing and reverse proxy
The application layer
- Micro service
- Service discovery
The database
- Relational database Management System (RDBMS)
  - The Master – slave copy set
  - The Master – Master copy set
  - The joint
  - shard
  - informal
  - SQL tuning
- NoSQL
  - The Key – value stored
  - Document storage
  - Wide column storage
  - Figure database
- SQL or no
The cache
- Client cache
- CDN cache
- Web server cache
- Database cache
- The application cache
- Database query-level caching
- Object level caching
- When to update the cache
  - The caching pattern
  - Direct writing mode
  - Back to the write mode
  - The refresh
asynchronous
- The message queue
- Task queue
- Back pressure mechanism
communication
- Transmission Control Protocol (TCP)
- User Datagram Protocol (UDP)
- Remote Control Call Protocol (RPC)
- Declarative State Transition (REST)
security
The appendix
- Two to the power of two
- The number of delays that every programmer should know
- Other system design interview questions
- Real architecture
- The company’s system architecture
- Company Engineering Blog
In progress
Thank you
contact
The license

Learning guidance

Review the recommended topics based on your interview timeline (short, medium, or long).

Q: Do I need to know everything here for an interview?

A: No, you don’t need to know everything just to prepare for the interview.

What you will be asked in an interview depends on the following factors:

Your experience
Your technical background
The position you are interviewing for
The company you are interviewing with
luck

Experienced candidates are usually expected to know more about system design. The architect or team lead is expected to know more than just their individual contributions. Top tech companies also usually have one or more system design interviews.

The interview will be broad and in-depth in several areas. This will help you understand some of the different themes of system design. Make appropriate adjustments to the following guidelines based on your timeline, experience, position and company.

Short term – Aim for the breadth of system design topics. Practice by solving some interview questions.
Middle stage – Aims at the breadth and primary depth of system design topics. Practice by solving lots of interview questions.
Long term – Aim for breadth and advanced depth of system design topics. Contact by solving most interview questions.

	In the short term	Mid –	For a long time
readingSystem Design ThemeTo get a general idea of how the system works	: + 1:	: + 1:	: + 1:
Read some of the ones you are interviewing forCompany Engineering BlogIn the article	: + 1:	: + 1:	: + 1:
readingReal architecture	: + 1:	: + 1:	: + 1:
reviewHow to handle a system design interview question	: + 1:	: + 1:	: + 1:
completeSystem design of interview questions and answers	Some of the	A lot of	Most of the
completeObject-oriented design of interview questions and solutions	Some of the	A lot of	Most of the
reviewOther system design interview questions	Some of the	A lot of	Most of the

How to handle a system design interview question

The system design interview is an open dialogue. They expect you to lead the conversation.

You can use the following steps to guide the discussion. To consolidate this process, use the following steps to complete the system design interview questions and solutions section.

Step 1: Describe usage scenarios, constraints, and assumptions

Put everything you need together and look at the problem. Keep asking questions so that we can define scenarios and constraints. Discuss assumptions.

Who will use it?
How will they use it?
How many users?
What is the role of the system?
What are the inputs and outputs of the system?
How much data do we want to process?
How many requests do we want to process per second?
What literacy ratio do we want?

Step 2: Create a high-level design

Use all the important components to describe a high-level design.

Draw the main components and connections
Prove your idea

Step 3: Design the core components

Conduct a detailed in-depth analysis of each of the core components. For example, if you are asked to design a URL shortening service, start the discussion:

Generates and stores a hash of the full URL
- MD5 and Base62
- Hash collision
- SQL or no
- Database model
Translate a hashed URL into the full URL
- Database lookup
API and object-oriented design

Step 4: Measure the design

Identify and address bottlenecks and limitations. For example, do you need the following to complete an extended issue?

Load balancing
The level of development
The cache
Database sharding

Discuss possible solutions and costs. Everything has to be a trade-off. Bottlenecks can be addressed using design principles for scalable systems.

Calculation of the back of the envelope

You may be asked to make some estimates by hand. The appendices involved refer to the following resources:

Use the back of an envelope for calculations
Two to the power of two
The number of delays that every programmer should know

System design of interview questions and answers

Common system design interview questions and related examples of discourse, codes and diagrams.

Solutions related to the content are in the solutions/ folder.

The problem
Design Pastebin.com (or bit.ly)	answer
Designing Twitter timelines and searches (or Facebook feeds and searches)	answer
Design a web crawler	answer
Design: Mint.com	answer
Design data structures for a social network	answer
Design a key-value store for a search engine	answer
Design Amazon’s sales rankings by category features	answer
Design a million-user system on AWS	answer
Add a system design problem	contribution

Design Pastebin.com (or bit.ly)

See practices and solutions

Designing Twitter timelines and searches (or Facebook feeds and searches)

See practices and solutions

Design a web crawler

See practices and solutions

Design: Mint.com

See practices and solutions

Design data structures for a social network

See practices and solutions

Design a key-value store for a search engine

See practices and solutions

Design Amazon sales rankings by category

See practices and solutions

Design a million-user system on AWS

See practices and solutions

Object-oriented design of interview questions and solutions

Common object-oriented design interview questions and discussion of examples, code and diagrams.

Content related solutions are in the solutions/ folder.

Note: This section is still being improved

The problem
Design a hash map	The solution
Design the LRU cache	The solution
Design a call center	The solution
Design a deck of cards	The solution
Design a parking lot	The solution
Design a chat service	The solution
Design a circular array	To be solved
Add an object-oriented design problem	To be solved

System Design Theme: Start here

Not familiar with system design?

First, you need to have a basic understanding of the general principles, what they are, how to use them, and the pros and cons.

Step 1: Review the scalability video lecture

Harvard University Lecture on Scalability

Topics include
- Vertical Scaling
- Horizontal scaling
- The cache
- Load balancing
- Database replication
- Database partition

Step 2: Review the extensibility article

scalability

Topics covered:
- Clones
- The database
- The cache
- asynchronous

Next steps

Next, we’ll look at higher-order trade-offs:

Performance and scalability
Latency and throughput
Availability and consistency

Remember that there are trade-offs and trade-offs on every front.

Then, we’ll dive into more specific topics such as DNS, CDN, and load balancers.

Performance and scalability

A service is scalable if its performance increases in proportion to the increase in resources. Often, improving performance means serving more units of work, and on the other hand, as the data set grows, it can also handle larger units of work. 1

Another way to look at performance and scalability:

If your system has performance issues, it’s slow for a single user.
If your system has scalability issues, individual users are fast but slow under high loads.

Sources and further reading

A few words about scalability
Scalability, availability, stability and patterns

Latency and throughput

Delay is the time it takes to perform an operation or the result of an operation.

Throughput is the number of such operations or operations per unit of time.

In general, you should aim to maximize throughput with acceptable levels of latency.

Sources and further reading

Understand latency and throughput

Availability and consistency

Theory of CAP

Source: CAP theory

In a distributed computing system, only the following two points can be met simultaneously:

Consistency – Each access gets the latest data but may receive an error response
Availability – A non-error response is received on each access, but the latest data is not guaranteed
Partition fault tolerance – The system can continue to operate in the event of a network failure in any partition

Networks are unreliable, so you need to support partition fault tolerance and make trade-offs between software availability and consistency.

CP — Consistency and partition fault tolerance

Waiting for a response from a partitioned node may cause delay errors. If your business needs require atomic reading and writing, CP is a good choice.

AP — Availability and partition fault tolerance

The most recent version of the data available on the response node may not be up to date. When the partition is resolved, writes may take some time to propagate.

An AP is a good choice if the business requirements allow for ultimate consistency or require the system to continue operating when there is an external failure.

Sources and further reading

Now, CAP theory
CAP theory is introduced in an easy-to-understand way
CAP FAQ

Consistency pattern

With multiple copies of the same data, we are faced with the choice of how to synchronize them so that the client has a consistent display of data. Recall the definition of consistency in CAP theory — you get the latest data every time you access it but you might get an error response

Weak consistency

After a write, the access may or may not see (write data). Try to optimize it so that it can access the latest data.

This can be seen in systems such as Memcached. Weak consistency works well in real-world use cases such as VoIP, video chat, and real-time multiplayer games. For example, if you lose your signal for a few seconds during a call, you can’t hear those seconds when you reconnect.

Final consistency

After a write, the access eventually sees the data being written (usually in milliseconds). Data is asynchronously replicated.

Systems such as DNS and email use this approach. Final consistency works well in high availability systems.

Strong consistency

Access is visible immediately after writing. Data is replicated synchronously.

This approach is used in file systems and relational databases (RDBMS). Strong consistency works well in systems that require recording.

Sources and further reading

Transactions across data centers

Availability pattern

There are two modes that support high availability: Fail-over and replication.

failover

Work to active-passive Switchover

The failover process for working to standby is that the working server sends periodic signals to the standby server. If the periodic signal is interrupted, the standby server switches to the IP address of the standby server and resumes services.

The downtime depends on whether the standby server is in “hot” standby or needs to be started from “cold” standby. Only the worker server handles the traffic.

Failover to standby is also known as master-slave switchover.

Dual-work switchover (active-active)

In dual-work switching, both sides manage traffic and spread the load between them.

If it is an external server, DNS will need to know both sides. If it is an Intranet server, the application logic will need to know both sides.

Dual – work switchover can also be called master – master switchover.

Defect: Failover

Failover requires additional hardware and complexity.
If the working system fails before the newly written data can be copied to the standby system, data can be lost.

copy

Master-slave replication and master-master replication

This topic further explores the database section:

Master-slave replication
Master — Master replication

The domain name system

Source: DNS security introduction

The domain name system converts domain names such as www.example.com into IP addresses.

The domain name system is hierarchical, with some DNS servers at the top level. When querying (domain name) IP, the route or ISP provides the information to connect to the DNS server. The lower level DNS server caches the mapping, which may fail due to DNS propagation delays. DNS results can be cached in the browser or operating system for a period of time, depending on the TTL.

NS Record (Domain name Service) – Specifies the DNS server that resolves domain names or subdomain names.
MX Record (mail exchange) – Specifies the mail server to receive messages from.
A Record (Address) specifies the IP address of the specified domain name.
CNAME (specification)– Mapping a domain name to another domain name orCNAMERecord (example.com points to www.example.com) or map to oneARecord.

Platforms such as CloudFlare and Route 53 provide the capability to manage DNS. Some DNS services route traffic centrally:

Weighted polling scheduling
- Prevent traffic from entering the server in maintenance
- Load balancing between clusters of different sizes
- A/B testing
Delay based routing
Routing based on geographic location

Defect: DNS

While caching mitigates DNS latency, connecting to a DNS server still introduces a slight delay.
Although they are typically managed by governments, ISPs, and large corporations, DNS service management can still be complex.
The DNS service recently suffered a DDoS attack that prevented users who did not know Twtter’s IP address from accessing Twiiter.

Sources and further reading

DNS architecture. Aspx)
Wikipedia
An article on DNS

Content Delivery Network (CDN)

Source: Why use CDN

The Content Delivery Network (CDN) is a global distributed network of proxy servers that provide content from a location close to users. Typically static content such as HTML/CSS/JS, images and videos is provided by the CDN, although amazon CloudFront and others also support dynamic content. The DNS resolution of the CDN tells the client which server to connect to.

Storing content on a CDN provides performance in two ways:

Provide resources from data centers close to users
Your server doesn’t have to actually process requests through a CDN

CDN Push

When the content on your server changes, push the CDN to accept new content. The CDN address that pushes directly to the CDN and overwrites the URL to point to your content. You can configure when content expires and when to update it. Content is pushed only when it changes or is added, minimizing traffic but maximizing storage.

CDN Pull

CDN pull is to pull a resource from the server when the first user requests it. You leave the content on your own server and rewrite the URL to the CDN address. Until the content is cached on the CDN, which will only make the request slower,

The TTL determines how long the cache lasts. CDN pull minimizes the storage space on the CDN, but can result in redundant traffic if expired files are pulled before actual changes are made.

CDN pull works well for high-traffic sites because traffic can be spread more evenly only if recently requested content is kept in the CDN.

Defect: the CDN

CDN costs may vary by traffic, and you may not use a CDN after a tradeoff.
If the content is updated before the TTL expires, the CDN cache content may become obsolete.
The CDN needs to change the URL address of static content to point to the CDN.

Sources and further reading

Global content distribution network
CDN pull and CDN push difference
Wikipedia

Load balancer

Source: Extensible system design patterns

The load balancer distributes incoming requests to computing resources such as application servers and databases. In either case, the load balancer returns the response from the computing resource to the appropriate client. The utility of a load balancer is:

Prevent requests from going to bad servers
Preventing resource overload
Helps eliminate single points of failure

Load balancers can be implemented with hardware (expensive) or software such as HAProxy. Added benefits include:

SSL endDecrypt the incoming request and encrypt the server response so that the back-end server does not have to perform these potentially costly operations.
- X.509 certificates do not need to be installed on every server.
Session retention — If the Web application does not track the Session, it issues cookies and routes requests from specific clients to the same instance.

Multiple load balancers in active-standby or dual-mode are typically set up to prevent failure.

Load balancers can route traffic in a number of ways:

random
The minimum load
Session/cookie
Polling scheduling or weighted polling scheduling algorithm
Layer 4 load balancing
Layer 7 load balancing

Layer 4 load balancing

The four-tier load balancer looks at the transport layer to determine how to distribute requests. Typically, this refers to the source, destination IP address, and port in the request header, but not the packet (packet) content. Layer-4 load balancers perform network address translation (NAT) to forward network packets to upstream servers.

A seven-layer load balancer

The seven-tier load balancer monitors the application layer to determine how requests are distributed. This involves the content of the request header, the message, and the cookie. A layer 7 load balancer terminates network traffic, reads messages, makes load balancing decisions, and passes them on to specific servers. For example, a seven-tier load balancer can connect video traffic directly to the server hosting the video, while directing more sensitive user billing traffic to a more secure server.

At the expense of flexibility, four-tier load balancing takes less time and computing resources than seven-tier load balancing, although it has little impact on the performance of modern commercial hardware.

Horizontal scaling

Load balancers can also help scale horizontally, improving performance and availability. Using commercial hardware is more cost-effective and has higher availability than scaling more expensive hardware vertically on a single piece of hardware. It’s easier to hire people for business hardware than it is for enterprise specific systems.

Defect: Horizontal scaling

Horizontal scaling introduces complexity and involves server replication
- Servers should be stateless: they should also not contain data associated with the user, such as sessions or profile pictures.
- Sessions can be centrally stored in a database or data store in a persistent cache (Redis, Memcached).
Downstream servers such as caches and databases need to scale with upstream servers to handle more concurrent connections.

Defect: Load balancer

Load balancers can become a performance bottleneck if there are insufficient or incorrectly configured resources.
The introduction of load balancers to help eliminate single points of failure resulted in additional complexity.
A single load balancer can lead to a single point of failure, but configuring multiple load balancers adds further complexity.

Sources and further reading

NGINX architecture
HAProxy Architecture Guide
scalability
Wikipedia)
Four layers of load balancing
Seven-layer load balancing
ELB listener configuration

Reverse proxy (Web server)

Source: Wikipedia

A reverse proxy is a Web server that centrally invokes internal services and provides a unified interface to common customers. Requests from the client are forwarded by the reverse proxy server to a responsive server, and the proxy returns the response from the server to the client.

The benefits include:

Enhanced security – Hides back-end server information, blocks IP addresses in the blacklist, and limits the number of connections to each client.
Improved scalability and flexibility – clients can only see the IP of the reverse proxy server, which allows you to add or subtract servers or modify their configuration.
The SSL session is terminated locally– Decrypt the incoming request and encrypt the server response so that the back-end server does not have to complete these potentially costly operations.
- Eliminates the need to install X.509 certificates on every server
Compression – Compresses the server response
Cache – Returns hit cache results directly
Static content– Provide static content directly
- HTML/CSS/JS
- The picture
- video
- , etc.

Load balancer and reverse proxy

Deploying a load balancer is useful when you have multiple servers. Typically, a load balancer routes traffic to a set of servers that perform the same functions.
Reverse proxies are useful even when there is only one Web server or application server, as described in the previous section.
Solutions such as NGINX and HAProxy can support both Layer 7 reverse proxy and load balancing.

The downside: Reverse proxy

Introducing reverse proxies increases the complexity of the system.
A single reverse proxy server may still fail. Configuring multiple reverse proxy servers (for example, failover) adds complexity.

Sources and further reading

Reverse proxy and load balancing
NGINX architecture
HAProxy Architecture Guide
Wikipedia

The application layer

Source: Introduction to scalable system architecture

By separating the Web services layer from the application layer (also known as the platform layer), the two layers can be scaled and configured independently. Adding a new API requires only adding an application server, not additional Web servers.

The single responsibility principle encourages small, autonomous services to work together. Small teams can plan for growth more aggressively by offering smaller services.

Worker processes in the application layer can also be asynchronous.

Micro service

Related to this discussion is the topic of microservices, which can be described as a series of small, modular services that can be deployed independently. Each service runs in a separate thread and communicates through clearly defined lightweight mechanisms to achieve business goals. 1

For example, Pinterest might have these micro-services: user profiles, followers, Feed streams, search, photo uploads, and so on.

Service discovery

Systems like Consul, Etcd, and Zookeeper help services find each other by tracking registrations, addresses, ports, and more. Health checks can help verify the integrity of the service and whether an HTTP path is often used. Consul and Etcd both have a built-in key-value store for storing configuration information and other shared information.

The downside: The application layer

Adding an application layer consisting of multiple loosely coupled services can be very different from a single system in terms of architecture, operations, processes, and so on.
Microservices add complexity to deployment and operations.

Sources and further reading

Introduction to scalable system architecture
Hack the system design interview
Service-oriented Architecture
They are introduced
Everything you need to know to build microservices

The database

Source: Expand your user base to your first 10 million

Relational database Management System (RDBMS)

Relational databases such as SQL are collections of data items organized as tables.

Note: the author of SQL may refer to MySQL

ACID is used to describe the characteristics of relational database transactions.

Atomicity – All operations within each transaction either complete or do not complete.
Consistency – Any transaction causes the database to transition from one valid state to another.
Isolation – The results of transactions executed concurrently are the same as those executed sequentially.
Persistence – After a transaction is committed, the impact on the system is permanent.

Relational database extension includes many techniques: master-slave replication, master-master replication, federation, sharding, de-normalization, and SQL tuning.

Sources: Scalability, availability, stability, patterns

A master-slave replication

The master library is responsible for both read and write operations and copies writes to one or more slave libraries, which are only responsible for read operations. A tree of slave libraries copies writes to more slave libraries. If the primary library is offline, the system can run in read-only mode until a secondary library is promoted to the primary or a new primary appears.

Disadvantages: master-slave replication

Additional logic is required to promote the slave library to the master library.
Disadvantages: Problems common to master-slave replication and master-master replication in replication.

Sources: Scalability, availability, stability, patterns

The main master replicates

Both primary libraries are responsible for both read and write operations, and write operations are coordinated. If one of the primary libraries hangs up, the system can continue reading and writing.

The downside: Master master replication

You need to add a load balancer or make changes in your application logic to determine which database to write to.
Most master-master systems either do not guarantee consistency (a violation of ACID) or write delays due to synchronization.
As more write nodes are added and latency increases, how to resolve conflicts becomes more and more important.
Disadvantages: Problems common to master-slave replication and master-master replication in replication.

The downside: Replication

If the master library fails before copying newly written data to another node, there is the potential for data loss.
The write is replaced into the copy responsible for the read operation. The copy may be blocked by too many write operations, causing the read function to be abnormal.
The more slave libraries are read, the more written data needs to be copied, resulting in more severe replication delays.
In some database systems, writing to the main library can be written in parallel with multiple threads, but reading copies only supports single-thread sequential writing.
Replication means more hardware and additional complexity.

Sources and further reading

Scalability, availability, and stability patterns
The master replicate more

The joint

Source: Expand your user base to your first 10 million

Federation (or partitioning by function) Divides a database into corresponding functions. For example, you can have three databases: forums, users, and products instead of just one monolithic database, thus reducing read and write traffic and replication latency for each database. A smaller database means more data that fits into memory, which in turn means a higher chance of cache hits. There is no centralized master library that can only write serially, so you can write in parallel, increasing load capacity.

The downside: Unity

If your database schema requires a large number of functions and tables, federation is not efficient.
You need to update the application logic to determine which database to read and write to.
Joining data from two libraries with a Server link is more complicated.
Federation requires more hardware and additional complexity.

Source and extension of reading: Union

Expand your user base to the first 10 million

shard

Sources: Scalability, availability, stability, patterns

Sharding distributes data across different databases so that each database manages only a subset of the entire dataset. Take the user database as an example. As the number of users increases, more and more shards are added to the cluster.

Similar to the benefits of federation, sharding can reduce read and write traffic, reduce replication, and improve cache hit ratio. There are also fewer indexes, which usually means faster queries and better performance. If one shard fails and the others still work, you can use some form of redundancy to prevent data loss. Similar to federation, there is no centralized master library that can only write serially, you can write in parallel, increasing load capacity.

The common practice is to separate the user table by the first letter of the user’s last name or the user’s geographical location.

The downside: Fragmentation

You need to modify the application logic to implement sharding, which leads to complex SQL queries.
Improper fragmentation may result in unbalanced data loads. For example, frequently accessed user data can cause the shard in which it resides to be loaded higher than other shards.
- Rebalancing introduces additional complexity. The sharding algorithm based on consistent hash can reduce this situation.
Joining multiple shards of data is more complex.
Sharding requires more hardware and additional complexity.

Source and extension: Sharding

The Age of sharding
Database sharding architecture)
Consistency hashing

informal

Canonicalization attempts to trade write performance for read performance. Redundant copies of data in multiple tables to avoid costly join operations. Some relational databases, such as PostgreSQl and Oracle, support materialized views to handle redundant information storage and ensure that redundant copies are consistent.

This further increases the complexity of handling join operations across data centers when data is partitioned using techniques such as federation and sharding. De-normalization can circumvent this complex join operation.

On most systems, reads are much more frequent than writes, by a ratio of 100:1 or even 1000:1. Read operations that require complex database joins are very expensive and take a lot of time on disk operations.

The downside: non-standardization

Data will be redundant.
Constraints can help keep redundant copies of information in sync, but they can add complexity to database design.
An unnormalized database may perform worse than a normalized database under high write loads.

Source and further reading: non – standardization

informal

SQL tuning

SQL tuning is a very broad topic, and there are many books on it.

It is important to use benchmarking and performance analysis to simulate and discover system bottlenecks.

Benchmarking – use tools such as AB to simulate high load conditions.
Performance analysis – Assist in tracking performance issues by enabling tools such as slow query logging.

Benchmarking and performance analysis may lead you to the following optimizations.

Improved model

For fast access, MySQL stores data in contiguous blocks on disk.
useCHARType stores fields of fixed length. Do not use itVARCHAR.
- CHAREfficient at fast, random access. If you are usingVARCHARIf you want to read the next string, you have to read to the end of the current string.
useTEXTType stores large blocks of text, such as the body of a blog.TEXTBoolean searches are also allowed. useTEXTThe field needs to store a pointer on disk to locate the text block.
useINTType stores larger numbers up to 2^32 or 4 billion.
useDECIMALType store currency to avoid floating point representation errors.
Avoid the use ofBLOBSStore objects, store the location where objects are stored.
VARCHAR(255)Is the maximum number of characters stored in 8 digits, in some relational databases, to maximize the use of bytes.
Set this parameter in the application scenarioNOT NULLConstraints toImprove search performance.

Use the correct index

Are you enquiring (SELECT,GROUP BY,ORDER BY,JOIN) columns are faster if they are indexed.
An index is usually represented as a self-balanced B-tree that keeps the data in order and allows search, sequential access, insertion, and deletion in logarithmic time.
Setting indexes will store data in memory, occupying more memory space.
Write operations are slow because the index needs to be updated.
When loading a large amount of data, it may be faster to disable the index, reload the data, and then rebuild the index.

Avoid costly join operations

It can be de-normalized if performance is required.

Split table

Splitting the hotspot data into separate tables can help with caching.

Tune the query cache

In some cases, query caching can cause performance problems.

Sources and further reading

MySQL query optimization Tips
Why is VARCHAR(255) common?
How do Null values affect database performance?
Slow Query logs

NoSQL

NoSQL is a general term for key-value database, document database, column database, or graph database. Databases are de-normalized, and table joins are mostly done in application code. Most NoSQL cannot implement true ACID-compliant transactions that support ultimate consistency.

BASE is commonly used to describe the features of NoSQL databases. Compared to CAP theory, BASE emphasizes usability over consistency.

Basic availability – The system guarantees availability.
Soft state – System state may change over time even without input.
Final consistency – The system eventually becomes consistent over a period of time when no input is received.

In addition to choosing between SQL or NoSQL, it is helpful to know which type of NoSQL database is best for your use case. We’ll take a quick look at key-value storage, document storage, column storage, and graph storage databases in the next section.

Key-value storage

Abstract model: hash table

Key-value storage can usually implement O(1) time read and write, with data stored in memory or SSD. The data store can maintain keys in lexicographical order to achieve efficient retrieval of keys. Key-value stores can be used to store metadata.

Key-value stores have high performance and are typically used to store simple data models or frequently modified data, such as caches in memory. Key-value stores provide limited operations, and if more operations are needed, the complexity is shifted to the application level.

Key-value storage is the basis for more complex storage systems such as document storage and, in some cases, even graph storage.

Sources and further reading

Key-value database
Disadvantages of key-value storage
Redis architecture
Memcached architecture

Document type storage

Abstract model: Documents are stored as key-value values of values

Document type storage is centered on a document (XML, JSON, binary files, and so on) that stores all information about a given object. The document store provides apis or query statements to implement queries based on the internal structure of the document itself. Note that many key-value storage databases use the feature of value storage metadata, which blurs the distinction between the two storage types.

Based on the underlying implementation, documents can be organized by collections, tags, metadata, or folders. Although different documents can be organized together or grouped together, they may have completely different fields from each other.

Some document type stores, such as MongoDB and CouchDB, also provide SQL-like query statements to implement complex queries. DynamoDB supports both key-value storage and document type storage.

Document type stores are highly flexible and are often used to deal with data that occasionally changes.

Source and Extension: Document type storage

Document-oriented databases
Mongo architecture
CouchDB architecture
Elasticsearch architecture

Column type storage

Source: SQL and NoSQL, a short history

Abstract model: nested ColumnFamily

> mappings
,>

The basic data units of type storage are columns (name/value pairs). Columns can be grouped in column families (SQL-like data tables). The super column family is subdivided into the ordinary column family. You can use row keys to access each column independently, forming a row of columns with the same row key value. Each value contains a timestamp of the version used to resolve version conflicts.

Google released its first columnar storage database, Bigtable, which influenced HBase, an active open source database in the Hadoop ecosystem, and Facebook’s Cassandra. Storage systems such as BigTable, HBase, and Cassandra store keys in alphabetical order, making it efficient to read key columns.

Column storage is highly available and scalable. It is commonly used for big data-related storage.

Source and extension: column storage

A brief history of SQL and NoSQL
BigTable architecture
Hbase architecture
Cassandra architecture

Figure database

Source: Graph database

Abstract model: diagram

In a graph database, a node corresponds to a record and an arc corresponds to a relationship between two nodes. Graph databases are optimized to represent complex or many-to-many relationships with numerous foreign keys.

Graph databases provide high performance for data models that store complex relationships, such as social networks. They are relatively new and not yet widely used, making it relatively difficult to find development tools or resources. Many diagrams are only accessible through the REST API.

Source and extension: NoSQL

Database Terminology
NoSQL Database – Survey and Decision Guide
scalability
No introduction
No pattern

SQL or no

Source: Transition from RDBMS to NoSQL

Reasons for choosing SQL:

Structured data
Strict model
Relational data
Complex join operations are required
The transaction
Clear extension patterns
More resources are available: developers, communities, code bases, tools, etc
Querying by index is very fast

Reasons for choosing NoSQL:

Semi-structured data
Dynamic or flexible patterns
Non-relational data
No complex join operations are required
Store TB (or even PB) data
High data intensive workloads
IOPS High throughput

Sample data suitable for NoSQL:

Buried data and log data
Leaderboards or score data
Temporary data, such as shopping carts
Frequently accessed (” hot “) tables
Metadata/lookup tables

Source and further reading: SQL or NoSQL

Expand your user base to the first ten million
Difference between SQL and NoSQL

The cache

Source: Extensible system design patterns

Caching can speed up page loading and reduce the load on the server and database. In this model, the dispatcher checks to see if the request has been responded to before, and if so, returns the previous result directly, eliminating actual processing.

The database fragmentation evenly distributed reading is the best. But hot data can lead to uneven read distribution, which can cause bottlenecks. If you put a cache in front of the database, it can smooth out the effects of uneven load and burst traffic on the database.

Client cache

The cache can be on the client side (operating system or browser), on the server side, or at a different cache layer.

CDN cache

CDN is also seen as a cache.

Web server cache

Reverse proxies and caches, such as Varnish, can serve both static and dynamic content directly. The Web server can also cache requests and return results without having to connect to the application server.

Database cache

The default configuration of a database usually includes a cache level, optimized for general use cases. Tuning the configuration to use different modes in different situations can further improve performance.

The application cache

Memory-based caches such as Memcached and Redis are key-value stores between applications and data stores. Because the data is stored in RAM, it is much faster than a typical database stored on disk. RAM is more restrictive than disk, so cache invalidation algorithms such as least Recently Used (LRU) can keep “hot” data in RAM and leave “cold” data alone.

Redis has the following additional features:

Persistence option
Built-in data structures such as ordered collections and lists

There are multiple cache levels, which fall into two broad categories: database queries and objects:

row-level
Query level
A complete serializable object
Fully rendered HTML

In general, you should try to avoid file-based caching because it makes copying and automatic scaling difficult.

Database query-level caching

When you query a database, store the hash of the query statement and the query result in the cache. This approach encounters the following problems:

It is difficult to delete cached results with complex queries.
If a piece of data, such as an item in a table, is changed, all cached results that might contain the changed item need to be deleted.

Object level caching

Treat your data as objects, just like you treat your application code. Let an application combine data from a database into a class instance or data structure:

If the object’s underlying data has changed, the object is removed from the cache.
Allow asynchronous processing: Workers assemble objects by using the latest cached objects.

When to update the cache

Since you can only store a limited amount of data in the cache, you need to choose a cache update strategy that is appropriate for your use case.

The caching pattern

Sources: From cache to in-memory data grid

Applications read and write from storage. The cache does not interact directly with storage, and the application does the following:

Look up records in the cache if the desired data is not in the cache
Load the required content from the database
Store the found results in the cache
Return what you want

def get_user(self, user_id):
    user = cache.get("user.{0}", user_id)
    if user is None:
        user = db.query("SELECT * FROM users WHERE user_id = {0}", user_id)
        if user is not None:
            key = "user.{0}".format(user_id)
            cache.set(key, json.dumps(user))
    return userCopy the code

Memcached is typically used this way.

Data added to the cache is read quickly. The caching mode is also known as lazy loading. Only requested data is cached, which prevents unrequested data from filling up the cache.

Disadvantages of caching:

If the requested data is not in the cache, three steps are required to retrieve the data, resulting in significant delays.
If the data in the database is updated, the data in the cache will become obsolete. The problem needs to be mitigated by setting TTL to force update cache or direct write mode.
When a node fails, it is replaced by a new node, which increases the latency.

Direct writing mode

Sources: Scalability, availability, stability, patterns

The application uses the cache as the primary data store, reading and writing data to the cache, which is responsible for reading and writing data from the database.

The application adds/updates data to the cache
The cache writes synchronously to the data store
Return what you want

Application code:

set_user(12345, {"foo":"bar"})Copy the code

Cache code:

def set_user(user_id, values):
    user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
    cache.set(user_id, user)Copy the code

Direct write mode is generally a slow operation due to save and write operations, but it is fast to read newly written data. Users are generally more comfortable with the slower pace of updating data than reading it. Data in the cache does not become stale.

Disadvantages of direct writing mode:

New nodes created due to failure or scaling are not cached until the database is updated. Caching applies direct write mode to alleviate this problem.
Most of the data written will probably never be read, and TTL minimizes this situation.

Back to the write mode

Sources: Scalability, availability, stability, patterns

In write back mode, the application performs the following operations:

Add or update entries to the cache
Asynchronous data writing improves write performance.

Disadvantages of write back mode:

The cache can lose data before its contents are successfully stored.
Performing direct write mode is more complex than caching or write-back mode.

The refresh

Sources: From cache to in-memory data grid

You can configure the cache to automatically refresh the most recently accessed content before expiration.

If the cache can accurately predict what data is likely to be requested in the future, the refresh can result in lower latency and read times.

Disadvantages of refresh:

Not being able to accurately predict what data will be needed in the future can result in worse performance than not using a refresh.

Disadvantages of caching:

You need to maintain consistency between the cache and the real data source, such as invalidation of the database based on the cache.
You need to change the application such as adding Redis or memcached.
Invalid caches are a challenge, and when to update caches is a complex issue associated with it.

asynchronous

Source: Introduction to scalable system architecture

Asynchronous workflows help reduce the time of requests that would otherwise be executed sequentially. They can help reduce request time by doing time-consuming work ahead of time, such as periodically summarizing data.

The message queue

Message queues receive, retain, and deliver messages. If sequential operations are too slow, you can use message queues with the following workflows:

The application publishes the job to the queue and notifies the user of the job status
A worker pulls the job out of the queue, processes it, and shows that the job is complete

Instead of blocking user operations, jobs are processed in the background. In the meantime, the client might do something to make it look like the task is complete. For example, if you’re sending a tweet, it might appear on your timeline right away, but it might take some time to push your tweet out to all your followers.

Redis is a satisfyingly simple message broker, but messages can get lost.

RabbitMQ is popular but requires you to adapt to the “AMQP” protocol and manage your own nodes.

Amazon SQS is hosted, but can have high latency, and messages can be delivered twice.

Task queue

Task queues receive tasks and their associated data, run them, and pass on their results. They can support scheduling and can be used to run computationally intensive jobs in the background.

Celery supports scheduling, mainly developed in Python.

Back pressure

If the queue starts to grow significantly, the queue size can exceed the memory size, resulting in cache misses, disk reads, and even slower performance. Back pressure helps us by limiting the queue size to maintain high throughput and good response times for jobs in the queue. Once the queue fills, the client gets the message that the server is busy with the HTTP 503 status code to retry later. The client can retry the request at a later time, perhaps in exponential retreat.

Disadvantages of asynchrony

Use cases such as simple calculations and real-time workflows may be better suited for synchronous operations because introducing queues can add latency and complexity.

communication

Source: OSI 7 layer model

Hypertext Transfer Protocol (HTTP)

HTTP is a way to encode and transfer data between a client and a server. It is a request/response protocol: client and server requests and responses for relevant content and completion status information. HTTP is standalone, allowing requests and responses to flow through many intermediate routers and servers that perform load balancing, caching, encryption, and compression.

A basic HTTP request consists of a verb (method) and a resource (endpoint). Here are some common HTTP verbs:

The verb	describe	* power etc.	security	cacheable
GET	Read the resource	Yes	Yes	Yes
POST	Create resources or trigger processes that process data	No	No	Yes, if the response contains refresh information
PUT	Create or replace resources	Yes	No	No
PATCH	Partially updated resource	No	No	Yes, if the response contains refresh information
DELETE	Delete the resource	Yes	No	No

Multiple executions will not produce different results.

HTTP is an application-layer protocol that relies on lower-level protocols such as TCP and UDP.

Source and further reading: HTTP

README +
What is HTTP?
Differences between HTTP and TCP
Difference between PUT and PATCH

Transmission Control Protocol (TCP)

Source: How to make a multiplayer game

TCP is a connection-oriented protocol over IP networks. Use a handshake to establish and disconnect connections. All packets sent are guaranteed to arrive at their destination in the original order, with the following measures to ensure that the packets are not damaged:

Serial number and parity code of each packet.
Confirmation packet) and automatic retransmission

If the sender does not receive the correct response, it resends the packet. If multiple times out, the connection will be disconnected. TCP implements traffic control) and congestion control. These safeguards cause latency and generally result in less efficient transport than UDP.

To ensure high throughput, the Web server can maintain a large number of TCP connections, resulting in high memory usage. Having a large number of open connections between Web server threads can be expensive and resource-consuming, that is, a memcached server. Connection pooling can help in addition to switching to UDP where applicable.

TCP is useful for time-constrained applications that require high reliability. Examples include Web servers, database information, SMTP, FTP, and SSH.

TCP is used instead of UDP in the following cases:

You need the data intact.
You want to automate best estimates of network throughput.

User Datagram Protocol (UDP)

Source: How to make a multiplayer game

UDP is connectionless. Datagrams (similar to packets) are guaranteed only at the datagram level. Datagrams may arrive at their destination in disorder, or they may be lost. UDP does not support congestion control. Although not as guaranteed as TCP, UDP is generally more efficient.

UDP can broadcast datagrams to all devices on a subnet. This is useful for DHCP because devices within the subnet have not yet been assigned IP addresses, which are required for TCP.

UDP is less reliable but suitable for voip, video chat, streaming media and real-time multiplayer games.

UDP is used instead of TCP in the following cases:

You need low latency
Worse than data loss is data latency
You want to implement your own error correction method

TCP and UDP

Network of game programming
Key differences between TCP and UDP
The difference between TCP and UDP
Transmission control protocol
User datagram protocol
Memcache expansion on Facebook

Remote Procedure Call Protocol (RPC)

Source: Crack the system design interview

In RPC, a client calls a method in another address space (usually a remote server). The calling code looks as if it is calling a local method, and the details of how the client and server interact are abstracted. Remote calls are generally slower and less reliable than local calls, so it is helpful to distinguish between the two. Popular RPC frameworks include Protobuf, Thrift, and Avro.

RPC is a “request-response” protocol:

Client program — call client stub program. Just like calling a local method, arguments are pushed onto the stack.
Client-side stub program – packages the id and parameters of the request procedure into the request information.
Client communication module – sends information from client to server.
Server communication module – passes accepted packets to the server stub.
Server-side stub program — unpackages the result, calls the server-side method based on the procedure ID and passes the parameters.

Examples of RPC calls:

GET /someoperation? data=anId POST /anotheroperation {"data":"anId";
  "anotherdata": "another value"
}Copy the code

RPC focuses on exposure methods. RPC is usually used to handle performance issues for internal communication, so you can handle local calls manually to better suit your situation.

Select native libraries (i.e., SDKS) when:

You know your target platform.
You want to control how you access your logic.
You want to have control over errors that occur in your library.
Performance and end-user experience are your top concerns.

Rest-compliant HTTP apis tend to be better suited to public apis.

The bad: the RPC

The RPC client is tightly tied to the service implementation.
A new API must be defined in each operation or use case.
RPC is difficult to debug.
You may not be able to easily modify existing technology. For example, if you want to ensure that RPCS are properly cached on a caching server like Squid, it might take some extra effort.

Declarative State Transition (REST)

REST is a mandatory client/server architecture design model based on a set of resource operations managed by the server. The server provides an interface to modify or obtain resources. All communication must be stateless and cacheable.

A RESTful interface has four rules:

Flag resource (URI in HTTP) – Use the same URI for all operations.
Represent changes (HTTP actions) – use action, headers and body.
Self-describing error message (status code in HTTP) — Use the status code, don’t rebuild the wheel.
HATEOAS (HTML interface in HTTP) — Your Web server should be accessible through a browser.

Examples of REST requests:

GET /someresources/anId

PUT /someresources/anId
{"anotherdata": "another value"}Copy the code

REST focuses on exposing data. It reduces client/server coupling and is often used in common HTTP API interface design. REST uses a more general and formal approach to exposing resources through URIs, expressing them through headers, and operating through actions such as GET, POST, PUT, DELETE, and PATCH. REST is easy to scale horizontally and isolate because of its stateless nature.

Weakness: the REST

Because REST focuses on exposing data, it may not adapt well when resources are not naturally organized or have complex structures. For example, returning the updated records that match a particular set of events in the past hour is difficult to represent as a path. With REST, this might be implemented using URI paths, query parameters, and possible request bodies.
REST generally relies on a few actions (GET, POST, PUT, DELETE, and PATCH), but sometimes these alone don’t work for you. For example, moving an outdated document into an archive folder may not be easily expressed by these verbs.
To render a single page, retrieving complex resources nested within a hierarchy requires multiple round-trip communications between clients and servers. For example, get blog content and its associated comments. For mobile applications using an uncertain network environment, these multiple round trips can be cumbersome.
As time goes on, more fields may be added to the API response, and older clients will receive all new data fields, even those they don’t need, resulting in increased load and greater latency.

Compare RPC with REST

operation	RPC	REST
registered	POST /signup	POST /persons
The cancellation	POST /resign { “personid”: “1234” }	DELETE /persons/1234
Reading user information	GET/readPerson? personid=1234	GET /persons/1234
Read the user’s inventory list	GET/readUsersItemsList? personid=1234	GET /persons/1234/items
Adds an item to the user’s item list	POST /addItemToUsersItemsList { “personid”: “1234”; “itemid”: “456” }	POST /persons/1234/items { “itemid”: “456” }
Update an item	POST /modifyItem { “itemid”: “456”; “key”: “value” }	PUT /items/456 { “key”: “value” }
Delete an item	POST /removeItem { “itemid”: “456” }	DELETE /items/456

Source: Do you really know why you prefer REST to RPC

Source and Extension: REST and RPC

Do you really know why you prefer REST to RPC
When is RPC more appropriate than REST?
REST vs JSON-RPC
Demystify RPC and REST
What are the disadvantages of using REST
Hack the system design interview
Thrift
Why use REST internally instead of RPC

security

This section needs more information. Come along!

Security is a broad topic. Unless you have considerable experience, a background in security, or are applying for a position that requires security knowledge, you don’t need to know more than the basics of security:

Encrypt during transportation and waiting
All user input and parameters sent from the user are processed to prevent XSS and SQL injection.
Use parameterized queries to prevent SQL injection.
Use the minimum permission rule.

Sources and further reading

Security guide for developers
OWASP top ten

The appendix

Sometimes you’ll be asked to make conservative estimates. For example, you might need to estimate how long it takes to generate 100 image thumbnails from disk or how much memory a data structure requires. The power of 2 table and some time data that every developer needs to know are handy references.

Two to the power of two

Power           Exact Value         Approx Value        Bytes
---------------------------------------------------------------
7                             128
8                             256
10                           1024   1 thousand           1 KB
16                         65.536                       64 KB
20                      1.048.576   1 million            1 MB
30                  1.073.741.824   1 billion            1 GB
32                  4.294.967.296                        4 GB
40              1.099.511.627.776   1 trillion           1 TBCopy the code

Sources and further reading

The power of 2

The number of delays that every programmer should know

Latency Comparison Numbers
--------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                          100   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy            10.000   ns       10 us
Send 1 KB bytes over 1 Gbps network     10.000   ns       10 us
Read 4 KB randomly from SSD*           150.000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250.000   ns      250 us
Round trip within same datacenter      500.000   ns      500 us
Read 1 MB sequentially from SSD*     1.000.000   ns    1.000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10.000.000   ns   10.000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps  10.000.000   ns   10.000 us   10 ms  40x memory, 10X SSD
Read 1 MB sequentially from disk    30.000.000   ns   30.000 us   30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA    150.000.000   ns  150.000 us  150 ms

Notes
-----
1 ns = 10^9 - seconds
1 us = 10^- 6 seconds = 1.000 ns
1 ms = 10^- 3 seconds = 1.000 us = 1.000.000 nsCopy the code

Indicators based on the above figures:

Read sequentially from disk at 30 MB/s
Read sequentially at 100 MB/s from 1 Gbps Ethernet
Read from SSD at 1 GB/s
Read from main memory at 4 GB/s
It travels around the earth six to seven times per second
There are 2,000 round trips per second within the data center

Delay number visualization

Sources and further reading

The number of delays every programmer should know – 1
The number of delays every programmer should know — 2
Design proposals, courses, and recommendations for building large distributed systems
Software engineering consulting on building large scalable distributed systems

The problem	reference
Design a file synchronization service similar to Dropbox	youtube.com
Design a search engine similar to Google	queue.acm.org stackexchange.com ardendertat.com stanford.edu
Design is similar to Google’s extensible web crawler	quora.com
Designing Google Docs	code.google.com neil.fraser.name
Design similar to Redis built value store	slideshare.net
Design a memcached-like caching system	slideshare.net
Design a recommendation system similar to Amazon’s	hulu.com ijcai13.org
Design a Bitly like short link system	n00tc0d3r.blogspot.com
Design a WhatsApp chat app	highscalability.com
Design an Instagram-like photo-sharing system	highscalability.com highscalability.com
Design Facebook news recommendations	quora.com quora.com slideshare.net
Designing Facebook’s timeline system	facebook.com highscalability.com
Design Facebook chat	erlang-factory.com facebook.com
Design a facebook-like chart search system	facebook.com facebook.com facebook.com
Design content delivery networks like CloudFlare	cmu.edu
Design a Twitter-like trending topic system	michael-noll.com snikolov .wordpress.com
Design a random ID generation system	blog.twitter.com github.com
Returns the request whose number is higher than k within a specified period	ucsb.edu wpi.edu
Design a service system with data from multiple data centers	highscalability.com
Design a multiplayer online card game	indieflashblog.com buildnewgames.com
Design a garbage collection system	stuffwithstuff.com washington.edu
Add more system design issues	contribution

Real architecture

An article about how real systems are designed in real life.

Source: Twitter timelines at scale

Instead of focusing on the details of the following article, focus on the following:

Discover common principles, techniques, and patterns in these articles.
What problems does each component solve, when does it work, and when does it not work
Review the passage you have learned

type	system	reference
Data processing	MapReduce– Google distributed data processing	research.google.com
Data processing	Spark– Databricks distributed data processing	slideshare.net
Data processing	Storm– Distributed data processing for Twitter	slideshare.net

Data store	Bigtable– Google’s column database	harvard.edu
Data store	HBase– Open source implementation of Bigtable	slideshare.net
Data store	Cassandra– Facebook’s column database	slideshare.net
Data store	DynamoDB– Amazon’s document database	harvard.edu
Data store	MongoDB– Document database	slideshare.net
Data store	Spanner– Google’s global distribution database	research.google.com
Data store	Memcached– Distributed memory caching system	slideshare.net
Data store	Redis– A distributed memory caching system that can persist and have value types	slideshare.net

File system	Google File System (GFS)– Distributed file system	research.google.com
File system	Hadoop File System (HDFS)– Open source implementation of GFS	apache.org

Misc	Chubby– Google’s low coupling lock service for distributed systems	research.google.com
Misc	Dapper– Distributed system tracking infrastructure	research.google.com
Misc	Kafka– LinkedIn’s publish-subscribe messaging system	slideshare.net
Misc	Zookeeper– Centralized infrastructure and coordination services	slideshare.net
	Add more	contribution

The company’s system architecture

Company	Reference(s)
Amazon	Amazon’s architecture
Cinchcast	Produces 1,500 hours of audio per day
DataSift	Mine 120,000 tweets per second in real time
DropBox	How do we zoom in Dropbox
ESPN	100,000 operations per second
Google	Google’s architecture
Instagram	14 million users, Megabyte photo storage What drives Instagram
Justin.tv	Justin.Tv’s live broadcast architecture
Facebook	Extensible Memcached for Facebook TAO: Distributed data store for Facebook social graph Facebook’s photo store
Flickr	Flickr architecture
Mailbox	Went from 0 to 1 million users in 6 weeks
Pinterest	From zero to billions of page views per month 18 million visitors, a tenfold increase, 12 employees
Playfish	It has 50 million monthly users and growing
PlentyOfFish	The structure of the PlentyOfFish
Salesforce	How do they process 1.3 billion transactions a day
Stack Overflow	Stack Overflow architecture
TripAdvisor	40M visitors, 200M page views, 30TB data
Tumblr	15 billion page views a month
Twitter	Making Twitter 10000 percent faster MySQL is used to store 250 million tweets per day 150M active users, 300K QPS, 22 MB/S firewall Extensible schedule Twitter size data Twitter behavior: Scale over 100 million users
Uber	How can Uber expand its real-time market
WhatsApp	Facebook bought WhatsApp’s architecture for $19 billion
YouTube	The scalability of YouTube YouTube, the architecture of the

Company Engineering Blog

The structure of the company you will be interviewing with

The problem you face may come from the same area

Airbnb Engineering
Atlassian Developers
Autodesk Engineering
AWS Blog
Bitly Engineering Blog
Box Blogs
Cloudera Developer Blog
Dropbox Tech Blog
Engineering at Quora
Ebay Tech Blog
Evernote Tech Blog
Etsy Code as Craft
Facebook Engineering
Flickr Code
Foursquare Engineering Blog
GitHub Engineering Blog
Google Research Blog
Groupon Engineering Blog
Heroku Engineering Blog
Hubspot Engineering Blog
High Scalability
Instagram Engineering
Intel Software Blog
Jane Street Tech Blog
LinkedIn Engineering
Microsoft Engineering
Microsoft Python Engineering
Netflix Tech Blog
Paypal Developer Blog
Pinterest Engineering Blog
Quora Engineering
Reddit Blog
Salesforce Engineering Blog
Slack Engineering Blog
Spotify Labs
Twilio Engineering Blog
Twitter Engineering
Uber Engineering Blog
Yahoo Engineering Blog
Yelp Engineering Blog
Zynga Engineering Blog

Sources and further reading

kilimchoi/engineering-blogs

In progress

Interested in adding some parts or helping to improve some parts? Join in!

Use MapReduce for distributed computing
Consistency hashing
Direct memory access (DMA) controller
contribution

Thank you

Certificates and sources are provided throughout the repository

Special thanks:

Hired in tech
Cracking the coding interview
High scalability
checkcheckzz/system-design-interview
shashank88/system_design
mmcgrana/services-engineering
System design cheat sheet
A distributed systems reading list
Cracking the system design interview

contact

Feel free to contact me to discuss any problems, questions or comments.

You can find my contact information on my GitHub page

The license

Creative Commons Attribution 4.0 International License (CC BY 4.0)

http:/ / creativecommons.org/licenses/by/4.0/Copy the code

The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. Android, iOS, React, front end, back end, product, design, etc. Keep an eye on the Nuggets Translation project for more quality translations.

Want to get your hands on a system architecture? One you can’t miss.

Introduction to System Design

translation

purpose

Learn how to design large systems

Learn from the open source community

Prepare for the system design interview

flashcards

Code Resources: Interactive programming challenges

contribution

Index of system design topics

Learning guidance

How to handle a system design interview question

Step 1: Describe usage scenarios, constraints, and assumptions

Step 2: Create a high-level design

Step 3: Design the core components

Step 4: Measure the design

Calculation of the back of the envelope

Related resources and further reading

System design of interview questions and answers

Design Pastebin.com (or bit.ly)

Designing Twitter timelines and searches (or Facebook feeds and searches)

Design a web crawler

Design: Mint.com

Design data structures for a social network

Design a key-value store for a search engine

Design Amazon sales rankings by category

Design a million-user system on AWS

Object-oriented design of interview questions and solutions

System Design Theme: Start here

Step 1: Review the scalability video lecture

Step 2: Review the extensibility article

Next steps

Performance and scalability

Sources and further reading

Latency and throughput

Sources and further reading

Availability and consistency

Theory of CAP

CP — Consistency and partition fault tolerance

AP — Availability and partition fault tolerance

Sources and further reading

Consistency pattern

Weak consistency

Final consistency

Strong consistency

Sources and further reading

Availability pattern

failover

Work to active-passive Switchover

Dual-work switchover (active-active)

Defect: Failover

copy

Master-slave replication and master-master replication

The domain name system

Defect: DNS

Sources and further reading

Content Delivery Network (CDN)

CDN Push

CDN Pull

Defect: the CDN

Sources and further reading

Load balancer

Layer 4 load balancing

A seven-layer load balancer

Horizontal scaling

Defect: Horizontal scaling

Defect: Load balancer

Sources and further reading

Reverse proxy (Web server)

Load balancer and reverse proxy

The downside: Reverse proxy

Sources and further reading

The application layer

Micro service

Service discovery

The downside: The application layer

Sources and further reading

The database

Relational database Management System (RDBMS)