How to build a high availability, high performance, easy to expand, scalable and secure application system? I believe that this is a puzzle plaguing countless developers. Here we take a website as an example to discuss how to do a good job in the architecture design of large application systems.

Evolution of architecture

The technical challenges of large websites are mainly due to the large number of users, high concurrent access, and large amounts of data.

The initial stage

Large websites are developed from small websites. Small websites do not have too many people to visit at the beginning, and only need a server is more than enough. The website architecture at this time is shown in the figure.

Application and data separation

With the development of services, a server cannot meet the requirements. More and more users access the server, leading to poor performance. More and more data leads to insufficient storage space. This is where you need to separate the application from the data.

After separating application and data, the whole website uses three servers: application server, file server and database server, as shown in the figure.

The three servers have different requirements on hardware resources. Application servers need to process a large amount of service logic, so faster and more powerful cpus are required. Database servers need fast disk retrieval and data caching, so they need faster hard disks and more memory; File servers need to store a large number of files uploaded by users, so they need a larger hard disk.

Use the cache

As the number of users increased, the site faced another challenge: the database was under too much pressure, resulting in access delays, which in turn affected the performance of the entire site and affected the user experience.

Website visits follow the 80/20 rule: 80% of business visits are focused on 20% of data. Since most of the business access concentrated in a small part of the data, so if this small part of the data cache in memory, is it possible to reduce the database access pressure, improve the data access speed of the entire website, improve the database write performance?

There are two types of caches used by web sites: local caches cached on application servers and remote caches cached on dedicated distributed cache servers. Local caches are faster to access, but they are limited by the application server’s memory limitations, and can compete with applications for memory. Remote distributed cache can be deployed in a cluster. A server with large memory can be deployed as a dedicated cache server. In theory, the cache service is not limited by memory capacity, as shown in the figure.

Use an application server cluster

With the use of caching, data access pressure is effectively alleviated, but a single application server can only handle the limited number of requests and connections, and the application server becomes the bottleneck of the entire website during peak times of website access.

Cluster is a common method to solve the problem of high concurrency and mass data. When a server’s processing capacity, storage space is insufficient, do not attempt to change a more powerful server, for large websites, no matter how powerful the server, can not meet the continuous growth of the business needs of the website. In this case, it is more appropriate to add a server to share the access and storage burden of the original server.

As long as you can improve the load by adding one server, you can continue to increase the number of servers and improve the performance of the system in the same way, thus achieving system scalability. Application server cluster is a relatively simple and mature design of scalable cluster architecture, as shown in the figure.

Through the load balancing scheduling server, the access requests from users’ browsers can be distributed to any server in the application server cluster. If there are more users, more application servers can be added to the cluster, so that the load pressure of application servers will no longer become the bottleneck of the whole website.

Reading and writing separation

After the website uses cache, the vast majority of data read operation access can be completed without the database, but there are still some read operation and all write operation need to access the database, after the website user reaches a certain scale, database because of high load pressure and become the bottleneck of the website.

Most mainstream databases provide the master/slave hot backup function. By configuring the master/slave relationship between two databases, data updates from one database server can be synchronized to the other server. Websites use this function of database to achieve database read and write separation, so as to improve database load pressure, as shown in the figure.

When writing data, the application server accesses the master database. The master database synchronizes data updates to the slave database through the master/slave replication mechanism, so that the application server can obtain data from the slave database when reading data. In order to facilitate application programs to access the database after read/write separation, a special data access module is usually used on the application server to make the database read/write separation transparent to applications.

Reverse proxy and CDN

As the business continues to grow and users grow, the speed at which users in different regions access the site varies greatly. In order to provide a better user experience, websites need to speed up website access. The main means are the use of CDN and reverse proxy, as shown in the figure.

The basic principle of CDN and reverse proxy is caching. The difference lies in that CDN is deployed in the equipment room of the network provider, so that users can obtain data from the nearest equipment room of the network provider when requesting website services. The reverse proxy is deployed in the central equipment room of a website. When a user requests a request to the central equipment room, the reverse proxy server is the first access server. If the reverse proxy server caches the requested resources, the reverse proxy server directly returns the requested resources to the user.

Use distributed file systems and distributed database systems

After read and write separation, the database is divided from one server into two servers. However, with the development of website business, it still cannot meet the demand, so it needs to use distributed database. The same is true for file systems, which require a distributed file system, as shown in the figure.

Distributed database is the last resort for website database splitting and is only used when single table data size is very large. Unless necessary, a more common method of database splitting is to deploy different business databases on different physical servers.

Use NoSQL and search engines

As website business becomes more and more complex, the demand for data storage and retrieval becomes more and more complex, so the website needs to adopt some non-relational database technologies such as NoSQL and non-database query technologies such as search engines, as shown in the figure.

business

Large sites respond to increasingly complex business scenarios by dividing their entire web business into different product lines using a divide-and-conquer approach. Technically, ** breaks down a website into many different applications, each of which is independently deployed and maintained. Applications can establish relationships through a hyperlink (the navigation links on the home page each point to a different application address), distribute data through message queues, and of course, form an associated complete system by accessing the same data storage system

Distributed service

As service separation becomes smaller and smaller and storage systems become larger, the overall complexity of application systems increases exponentially, making deployment and maintenance more difficult.

Since each application system needs to perform many same service operations, such as user management and product management, these shared services can be extracted and deployed independently. These reusable services connect to the database to provide shared services, while the application system only needs to manage the user interface and complete specific business operations by invoking shared services through distributed services, as shown in the figure.

The architecture of large sites has evolved to this point, and basically most of the technical problems have been solved.

Architectural patterns

In order to solve a series of problems and challenges faced by application systems such as high concurrent access, massive data processing and high reliable operation, large Internet companies have put forward many solutions in practice to achieve various technical architecture objectives such as high performance, high availability, easy scaling, scalability and security. These solutions are then repeated by more companies, leading to architectural patterns.

layered

Layering is one of the most common architectural patterns in enterprise application systems. The system is divided into several parts in the horizontal dimension, each part is responsible for a relatively single part of the responsibility, and then through the dependence and invocation of the upper layer to the lower layer to form a complete system.

In the website architecture, the application system is usually divided into application layer, service layer and data layer, as shown in the figure below.

By layering, a huge software system can be better divided into different parts, facilitating development and maintenance. Each layer has a certain degree of independence. As long as the call interface remains unchanged, each layer can evolve independently according to specific problems without requiring other layers to make corresponding adjustments.

** However, hierarchical architecture also has some challenges, that is, hierarchical boundaries and interfaces must be properly planned. In the development process, the constraints of hierarchical architecture must be strictly followed, and cross-layer invocation and reverse invocation are prohibited. ** In practice, layering can continue within large layering structures.

The hierarchical architecture is logical, and the three-tier architecture can be deployed on the same physical machine. However, with the development of website business, it is necessary to separate the hierarchical module deployment, so that the website has more computing resources to cope with more and more users to visit.

segmentation

Stratification is to slice the software in the horizontal aspect, and segmentation is to slice the software in the vertical aspect.

The larger the site, the more complex its functions, and the more varied its services and data processing. These different functions and services are separated and packaged into modules with high cohesion and low coupling, which is helpful for software development and maintenance on the one hand. On the other hand, it facilitates the distributed deployment of different modules and improves the concurrent processing ability and function expansion ability of the website.

The granularity of large sites may be small. For example, in the application layer, different businesses are separated. For example, shopping, forum, search and advertising are divided into different applications, which are taken charge of by independent teams and deployed on different servers.

distributed

For large sites, one of the main purposes of layering and partitioning is to facilitate distributed deployment of the shelled modules, where the different modules are deployed on different servers and work together through remote calls. Distributed means more resources can be used to perform the same function, and more concurrent access and data can be handled.

However, while solving the problem of high concurrency, distribution also brings other problems. Typical examples are the following:

  1. This means that service invocations must be made over the network, which can have a serious impact on performance.

  2. The more servers there are, the greater the probability of downtime is. The unavailability of services may lead to the unaccessibility of many applications, reducing the availability of websites.

  3. It is difficult to maintain data consistency in a distributed environment, and distributed transactions are difficult to guarantee.

  4. System dependence is complex, development management and maintenance is difficult.

Therefore, distributed design should act according to the specific situation. Common distributed solutions include distributed services, distributed databases, distributed computing, distributed configuration, distributed locks, and distributed file systems.

The cluster

In distributed mode, layered and segmented modules are deployed independently. However, if users access centralized modules, independent servers need to be clustered. That is, multiple servers deploy the same application to form a cluster and provide external services through load balancing devices.

Because a server cluster has more servers providing the same service, it can provide better concurrency, simply adding new machines to the cluster as more users access it. At the same time, when a server fails, the load balancing device or the failover mechanism of the system forwards requests to other servers in the cluster to improve system availability.

The cache

Caching is storing data in the nearest location to a calculation to speed up processing. Caching is the first step to improve software performance. It is almost ubiquitous in complex software design. For example, common reverse proxy, Redis (persistence not enabled), CDN, etc.

There are two prerequisites for using the cache. One is that the data access hotspot is unbalanced, and some data will be accessed more frequently, so the data should be stored in the cache. Second, the data is valid within a certain period of time and does not expire soon. Otherwise, the cache data is invalid and dirty reads are generated, affecting the correctness of the results.

In addition to speeding up the speed of data access, caching can also reduce the load pressure of back-end applications and data storage. Almost all website databases are designed for load capacity based on the premise of caching.

asynchronous

An important goal of application systems is to reduce coupling. In addition to the mentioned methods of stratification, segmentation and distribution, another important means of system decoupling is asynchronous. The message transmission between businesses is not synchronous invocation, but a business operation is divided into multiple stages, and each stage is asynchronously executed by sharing data for collaboration.

** Asynchronous architecture is a typical producer-consumer model, there is no direct invocation of the two, as long as the data structure remains unchanged, each other’s function implementation can be arbitrarily changed without mutual influence, which is very convenient for the website to expand new functions. ** In addition, using asynchronous message queues has the following advantages:

  • ** Improve system availability. ** When the consumer server fails, data is stored in the message queue server and the producer server can continue to process business requests, with the system as a whole performing without failure. Once the consumer server is back to normal, it continues processing the data in the message queue.

  • ** Speed up website response. ** After processing the business request, the producer server at the front end of the business processing writes the data to the message queue, which can return without waiting for the consumer server to process, reducing the response latency.

  • ** Eliminates concurrent access peaks. ** Users visit the site randomly, with peaks and valleys. Using message queues to put a sudden increase in access request data into a message queue and wait for the consumer server to process it in turn does not put too much strain on the overall site load.

However, it should be noted that using asynchronous mode to process business may affect user experience and business process, and requires the support of website product design.

redundant

The website needs 7×24 hours continuous operation, but the server may break down at any time, especially when the server scale is relatively large, the occurrence of a server down is an inevitable event. In order to ensure that the site can continue to serve in the case of server downtime, without loss of data, it requires a degree of server redundancy operation, data redundancy backup, so that when a server goes down, the service and data access on it can be transferred to other machines.

Services with low access and load must also be deployed with at least two servers in a cluster to achieve high service availability through redundancy. In addition to periodically archiving databases for cold backup, the database needs to be separated from the primary database for real-time synchronization to achieve hot backup.

Automation and Security

At present, the automation architecture design of application system mainly focuses on release operation and maintenance. This includes automated publishing, automated code management, automated testing, automated security monitoring, automated deployment, automated monitoring, automated alarm, automated failover and recovery, automated degradation, and automated resource allocation.

In terms of security architecture, the system also accumulates many modes: identity authentication through password and mobile phone verification code; The login, transaction and other operations need to encrypt network communication, and the sensitive data stored on the website server, such as user information, is also encrypted. To prevent bots from abusing network resources to attack websites, websites use verification codes to identify them. For common XSS attacks, SQL injection, encoding conversion and other corresponding processing; Filtering junk information and sensitive information; Risk control for transaction transfer and other important operations according to transaction mode and transaction information.

Architecture Core Elements

Wikipedia defines architecture as “an abstract description of the overall structure and components of software used to guide the design of various aspects of a large software system”.

Generally speaking, in addition to functional requirements, software architecture needs to focus on performance, availability, scalability, extensibility, and security.

performance

Performance is an important indicator of a website, and any software architecture design must consider the possible performance problems. Because performance problems are almost everywhere, there are many ways to optimize site performance. The main ways can be summarized as follows:

  • Browser: browser caching, use page compression, layout pages properly, reduce Cookie transfer, etc

  • CDN and reverse proxy

  • Local and distributed caches

  • Asynchronous message queue

  • Application layer: server cluster

  • Code layer: multi-threading, improved memory management, etc

  • Data layer: indexing, caching, SQL optimization, and rational use of NoSQL databases

availability

The main method of high availability of websites is redundancy. Applications are deployed on multiple servers to provide access at the same time, and data is stored on multiple servers to back up each other. The failure of any server does not affect the overall availability of applications and does not lead to data loss.

For application servers, multiple application servers form a cluster using load balancing devices to provide external services. When any server breaks down, you only need to switch requests to other servers. However, the prerequisite is that the application server cannot store requested session information.

Data on a storage server must be backed up in real time. When the server is down, data access must be transferred to an available server and data restoration must be performed to ensure that data is still available even when other servers are down.

In addition to the operating environment, the high availability of a website also requires the quality assurance of the software development process. Through pre-release verification, automated testing, automated release, gray release and other means, the possibility of introducing faults into the online environment is reduced.

scalability

The main criteria for scaling an architecture are whether it is possible to build a cluster with multiple servers, whether it is easy to add new servers to the cluster, whether new servers can provide the same services as the original servers, and whether there is a limit to the total number of servers that can be accommodated in a cluster.

For application server clusters, servers can be continuously added to the cluster by using appropriate load balancing devices. For a cluster of cache servers, efficient cache routing algorithms are required to avoid route failures caused by adding new servers. Relational database is difficult to achieve large-scale cluster scalability, so the cluster scalability scheme of relational database must be implemented outside the database, by means of routing partitions and other means to deploy multiple database servers into a cluster. As with most NoSQL database products, scalability is usually very good because they are built for massive data.

scalability

The main measure of architectural extensibility is whether there is little coupling between different products. When new business products are added to the website, is it possible to achieve no impact on the transparency of existing products, and new products can be launched without any change or few changes to existing business functions?

The main means of scalable web architecture are event-driven architecture and distributed services.

Event-driven architecture is usually implemented on websites by message queue, which constructs user requests and other business events into messages and publishes them to the message queue, and the message handler as a consumer obtains messages from the message queue for processing. By separating message generation from message processing in this way, new message producer tasks or new message consumer tasks can be transparently added.

Distributed services separate business and reusable services and invoke them through a distributed service framework. A new product can implement its own business logic by invoking reusable services with no impact on existing products. When reusable services are upgraded, they can be upgraded transparently by providing services of multiple versions without forcing applications to synchronize changes.

security

The security architecture of a website is to protect the website from malicious access and attack, and protect the important data of the website from being stolen. The measure of a website’s security architecture is whether it has a credible strategy for dealing with existing and potential attacks and theft.