Like it and see. Make it a habit.
This article has been collected on github-java3c, which contains my series of articles, interview questions, self-study materials, architect videos, e-books, etc.
The technical challenges of large web sites are mainly due to the large number of users, high concurrent access and large amount of data. Any simple business that needs to deal with tens of millions of data and hundreds of millions of users becomes very difficult. Large web architecture addresses this type of problem.
This is one of the most common questions senior developers and architects face in their interviews, as not knowing the various forms of website architecture can affect the outcome of the interview.
1, the initial stage of the website architecture
Large websites are developed from small websites, and so is the website architecture, which gradually evolved from small website architecture. Small websites at the beginning do not have too many people to visit, only need a server is more than enough, then the website architecture is shown as follows:
Applications, databases, files, and all resources are on one server.
2. Application services and data services are separated
With the development of website services, a server gradually cannot meet the needs: more and more users access the performance is worse and worse, more and more data leads to insufficient storage space. This is where you need to separate the application from the data.
After separating applications and data, the entire site uses three servers: application server, file server, and database server. The three servers have different requirements for hardware resources:
Application servers need to handle a lot of business logic, so they need faster and more powerful cpus. Database servers need fast disk retrieval and data caching, so they need faster disks and more memory; File servers need to store a large number of files uploaded by users, so they need a larger hard disk.
At this point, the architecture of the website system is shown in the figure below:
After the separation of applications and data, servers with different features assume different service roles, and the concurrent processing capability and data storage space of the website are greatly improved, supporting the further development of the website business. However, as the number of users gradually increased, the site again faced a challenge: too much database pressure caused access delays, which in turn affected the performance of the entire site, the user experience suffered. This needs to further optimize the website architecture.
3. Use caching to improve site performance
Web visits are characterized by the same 80/20 rule as wealth distribution in the real world: 80% of business visits focus on 20% of data. Since most business access is concentrated in a small part of the data, so if this small part of the data cache in memory, you can reduce the database access pressure, improve the speed of data access throughout the website, improve the database write performance.
There are two types of caches used by web sites: local caches cached on application servers and remote caches cached on dedicated distributed cache servers.
Local caches are faster to access, but they are limited by the application server’s memory limitations, and can compete with applications for memory. Remote distributed cache can be deployed in a cluster. Servers with large memory can be deployed as dedicated cache servers. In theory, the cache service is not limited by memory capacity.
After the use of caching, data access pressure is effectively relieved, but a single application server can handle limited connection requests, during the peak times of website access, the server becomes the bottleneck of the entire website.
4. Use application server clusters to improve the concurrent processing capacity of websites
Cluster is a common method to solve the problem of high concurrency and mass data.
When a server’s processing capacity, storage space is insufficient, do not attempt to replace a more powerful server, for large websites, no matter how powerful the server, can not meet the continuous growth of the business needs of the website. In this case, it is more appropriate to add a server to share the access and storage burden of the original server.
As far as website architecture is concerned, as long as we can improve the load by adding one server, we can continue to increase the number of servers and improve system performance in the same way, so as to achieve system scalability.
Application server implementation cluster is a relatively simple and mature design of website scalable architecture, as shown in the figure below:
Through the load balancing scheduling server, the access requests from users’ browsers can be distributed to any server in the application server cluster. If there are more users, more application servers can be added to the cluster, so that the pressure of application servers will no longer become the bottleneck of the whole website.
5, database read and write separation
Site after using a cache, making access to most of the data read operation can be done through a database can not but there is still a part of the read operation (cache access do not hit, cache expiration), and all the write operation need to access the database, in web site after reaching a certain size, the database because of the high pressure load and become the bottleneck of the website.
Most mainstream databases provide the master/slave hot backup function. By configuring the master/slave relationship between two databases, data updates from one database server can be synchronized to the other server.
The website uses this function of database, realizes the database read and write separation, thus improves the database load pressure. As shown below:
When writing data, the application server accesses the master database. The master database synchronizes data updates to the slave database through the master/slave replication mechanism, so that the application server can obtain data from the slave database when reading data. In order to facilitate application programs to access the database after read/write separation, a special data access module is usually used on the application server to make the database read/write separation transparent to applications.
6. Use reverse proxy and CDN to speed up website response
With the continuous development of the website business, the scale of users is increasing. Due to the complex network environment in China, the speed of accessing the website varies greatly among users in different regions. Studies have shown that website access delay is positively correlated with user turnover rate. The slower a website access is, the more likely users are to lose patience and leave. In order to provide a better user experience and retain users, websites need to speed up website access.
The main means are the use of CDN and direction proxy. As shown below:
CDN and reverse proxy are based on caching.
CDN deployed in the network provider’s room, when the user request web service, can from his network provider room to get the data of the current reverse proxy is deployed on the web site at the center of the room, when the user requests arriving at center machine room, the first access server is a reverse proxy server, if the cache in the reverse proxy server, users of a resource required to service this request It will be returned directly to the user
The purpose of using CDN and reverse proxy is to return data to the user as soon as possible, on the one hand, to speed up the user access speed, on the other hand, to reduce the load on the back-end server.
7. Use distributed file systems and distributed database systems
No single powerful server can meet the growing business needs of large web sites. After read and write separation, the database is divided from one server into two servers. However, with the development of website business, it still cannot meet the demand, so it needs to use distributed database. The same goes for file systems, which require a distributed file system.
As shown below:
Distributed database is the last resort for website database splitting and is only used when single table data size is very large. A more common method of database splitting for web sites is to deploy data from different businesses on different physical servers.
8. Use NoSQL and search engines
As website business becomes more and more complex, the demand for data storage and retrieval is also more and more complex, so websites need to adopt some non-relational database technologies such as NoSQL and non-database query technologies such as search engines.
As shown below:
Both NoSQL and search engines are technologies derived from the Internet and have better support for scalable distributed features. The application server accesses all kinds of data through a unified data access module, which relieves the application of managing many data sources.
9. Business split
Large sites respond to increasingly complex business scenarios by dividing their entire web business into different product lines using a divide-and-conquer approach. For example, large shopping and transaction websites will split their home page, shops, orders, buyers and sellers into different product lines, which are under the responsibility of different business teams.
Technically, a website will be divided into many different applications according to product lines, and each application will be deployed independently. Applications can establish relationships through a hyperlink (the navigation links on the home page each point to a different application address), data can be distributed through message queues, and of course, the most important is to access the same data storage system to form an associated complete system.
As shown below:
Distributed microservices
As service separation becomes smaller and smaller and storage systems become larger, the overall complexity of application systems increases exponentially, making deployment and maintenance more difficult. Because all applications need to connect to all database systems, the number of connections is the square of the server size in a website with tens of thousands of servers, resulting in insufficient database connection resources and denial of service.
Since each application system needs to perform many same service operations, such as user management and product management, these shared services can be extracted and deployed independently. These reusable services connect to the database to provide shared services, while application systems only need to manage user interfaces and complete specific business operations by calling shared services through distributed services.
As shown below: