Large site architecture is a series of documents, welcome your attention. The topic of this sharing: e-commerce site architecture case. From the requirements of e-commerce website to the stand-alone architecture, gradually evolved into a commonly used, for reference distributed architecture prototype. In addition to functional requirements, it also has certain non-functional quality requirements (architectural objectives) such as high performance, high availability, scalability and scalability.

According to the actual needs, transformation, expansion, support ten million PV, is no problem.

This sharing outline of the reasons for the case of e-commerce site demand site primary architecture system capacity estimation site architecture analysis site architecture optimization architecture summary of e-commerce site case, a total of three this paper mainly explains the site needs, site initial architecture, system capacity estimation method.

At present, there are mainly several types of distributed large websites. 1. Large portals, such as netease and Sina, etc.; 2.SNS websites, such as Xiaonei, Kaixin001, etc.; 3. E-commerce websites: Such as Alibaba, JINGdong Mall, Gome online, Autohome, etc. Large portals are generally news information, which can be optimized by CDN, static and other methods. Kaixin00.com is more interactive, and may introduce more NOSQL, distributed cache, and use high-performance communication framework. E-commerce website has the characteristics of the above two categories, such as product details can use CDN, static, high interactive need to use NOSQL and other technologies. Therefore, we use e-commerce website as a case for analysis.

Two, e-commerce website needs Customer needs:

Establish a full category of e-commerce website (B2C), users can buy goods online, online payment, can also be cash on delivery; Users can communicate with customer service online when purchasing; After receiving the product, the user can give the product score and evaluation; At present, there is a mature purchase, sales and inventory system; Need to connect with the website; Hope to support 3~5 years of business development; The number of users is expected to reach 10 million in 3-5 years. Regularly hold double 11, double 12, 38 men’s Day and other activities; Other functions refer to sites such as JD.com or Gome online. The customer is the customer, will not tell you what he wants, only tell you what he wants, we often have to guide, dig the needs of customers. Fortunately, it provides a clear reference site. Therefore, the next step is to do a lot of analysis, combined with the industry, as well as reference websites, to provide customers with solutions.

Others are slightly ~~~~~

Requirement function matrix

Requirements management Traditionally, requirements are described using use case diagrams or module diagrams (requirements lists). This often ignores a very important requirement (non-functional requirement), so it is recommended to use the requirements functional matrix to describe requirements.

The demand matrix of this e-commerce website is as follows:

Site demand function demand Nonfunctional requirements The whole category of e-commerce site classification management, merchandise management for more convenient category management (flexibility) site access speed faster (performance) photo storage requirements (mass small picture) users can buy goods online member management, shopping cart, settlement function good shopping experience (availability, Performance) Online payment or cash on delivery a variety of online payment methods payment process to be secure, data encryption (security) a variety of payment interface flexible switching (flexibility, scalability) can communicate with customer service online customer service function reliability: Instant messaging commodity scoring evaluation commodity review at present there are mature purchase-sales-inventory system docking purchase-sales-inventory is a constraint condition Data consistency should be considered when docking, robustness supports 3 to 5 years, business development is a constraint condition scalability, scalability 3 to 5 years users reach 10 million constraints double 11, Double 12, 38 men’s Day and other activities activity management, second killing surge access traffic (scalable) real-time requirements (high performance) refer to JINGdong or Gome online reference conditions

The above is a simple example of e-commerce website requirements, the purpose is to explain (1) when analyzing requirements, to be comprehensive, large distributed system focus on non-functional requirements; (2) Describe a simple e-commerce demand scenario, so that we can have a basis for the next step of analysis and design.

In the beginning, there are three servers, one for application deployment, one for database deployment, and one for NFS file system deployment.

This is a few years ago more traditional practice, before I saw a website more than 100,000 members, vertical clothing design portal, N more pictures. A server is used to deploy the application, database, and image storage. There were a lot of performance issues.

The diagram below:

However, the current mainstream website architecture has undergone earth-shaking changes. Clusters are generally used for high availability design. At least it looks like this.

(1) Use a cluster to implement redundancy for application servers to achieve high availability; (Load balancing devices can be deployed with applications.)

Use database active/standby mode to achieve data backup and high availability;

Iv. System capacity Estimation Procedure:

Number of registered users – daily UV volume – daily PV volume – daily concurrent volume; Peak estimate: 2~3 times of the usual amount; The system capacity is calculated according to the number of concurrent transactions and storage capacity. Customer demand: the number of users in 3-5 years reaches 10 million registered users;

Estimated number of concurrent requests per second:

The daily UV is 2 million (80-20 principle); Click and browse 30 times a day; PV volume: 20030=60 million; Concentrated visits: 240.2= 60 million in 4.8 hours 0.8=48 million (80/20); Concurrency per minute: 4.860=288 minutes, 4800/288= 167,000 accesses per minute (about equal to); Concurrency per second: 167,000 /60=2780 (approximately equal to); Assuming that the peak is three times the normal value, the number of concurrent requests per second can reach 8340 times. 1 ms =1.3 accesses; Do you regret not studying math? ! (Do not know whether the above calculation is wrong, ha ha ~~)

Server estimation :(using the tomcat server as an example)

Supports 300 concurrent calculations per second per Web server. Normally 10 servers are required (approximately equal to); [The default tomcat configuration is 150.] Peak period: 30 servers are required. Capacity estimation: 70/90 principle

The system CPU is generally maintained at the level of 70% or so, and reaches the level of 90% in the peak period, which does not waste resources and is relatively stable. Memory, IO similar.

The above estimates are for reference only, as server configuration, business logic complexity and so on all have an impact. CPU, hard disk, network, etc. are no longer evaluated here.

Five, website architecture analysis

Based on the above projections, there are several problems:

A large number of servers need to be deployed, perhaps 30 Web servers for peak computing. And these 30 servers, only when the second kill, activity will be used, there is a lot of waste. All applications are deployed on the same server, causing serious coupling between applications. Vertical and horizontal shards are required. Redundant code exists in a large number of applications. Server SESSION synchronization consumes a large amount of memory and network bandwidth. Large sites generally need to do the following architectural optimizations (optimizations are considered during architectural design, usually at the architecture/code level, tuning is mainly simple parameters, such as JVM tuning; If tuning involves a lot of code modification, it’s not tuning, it’s refactoring) :

Service splitting Application cluster deployment (distributed deployment, cluster deployment, and load balancing) Multi-level cache single sign-on (distributed Session) Database cluster (read/write separation, separate database and table) servitization message queue Other technologies 6. Website architecture optimization 6.1 Service splitting

According to business attributes, vertical segmentation is carried out, which is divided into product subsystem, shopping subsystem, payment subsystem, comment subsystem, customer service subsystem, interface subsystem (interconnection with external systems such as purchase, sale and stock, SMS and so on).

According to the hierarchical definition of business subsystem, it can be divided into core system and non-core system. Core system: product subsystem, shopping subsystem, payment subsystem; Non-core: comment subsystem, customer service subsystem, interface subsystem.

Function of business separation: specialized teams and departments can be responsible for subsystems, and professional people can do professional things to solve problems of coupling and expansibility between modules; Each subsystem is deployed separately to avoid the problem that one application is suspended and all applications are unavailable due to centralized deployment.

Level definition function: When traffic bursts, it protects critical applications to achieve elegant degradation. Protect critical applications from being affected.

Split architecture diagram:

For details, see Deployment Plan 2

6.2 Application Cluster Deployment (Distributed, Clustered, and Load Balancing)

Distributed deployment: After services are split, applications are deployed separately. Applications directly communicate with each other through RPC.

Cluster deployment: E-commerce sites require high availability. Each application must deploy at least two servers for cluster deployment.

Load balancing: a high availability system must implement load balancing for common applications, built-in load balancing for distributed services, and active/standby mode for relational databases.

Layout of the cluster deployment rear frame:

6.3 Multi-level Cache

Caches are generally classified into two types, local caches and distributed caches, depending on where they are stored. This case adopts the method of two level cache to design the cache. Level 1 cache is local cache and level 2 cache is distributed cache. (Also page cache, fragment cache, etc., that’s a more granular partition)

Level 1 cache, cache data dictionary, and commonly used hotspot data such as basic immutable/regularly changing information, level 2 cache cache all the cache required. When the level-1 cache expires or becomes unavailable, data from the level-2 cache is accessed. If the level 2 cache is not available, the database is accessed.

The ratio of cache, generally 1:4, can be considered to use cache. (In theory, 1:2 is enough).

Depending on the business characteristics, the following cache expiration policies can be used:

Cache expiration automatically; Cache triggers expiration; 6.4 Single Sign-on (Distributed Session)

The system is divided into multiple subsystems. After the system is deployed independently, session management problems are inevitable. Generally, Session synchronization, Cookies, and distributed Session can be adopted. E-commerce websites generally adopt distributed Session implementation.

Further, a perfect single sign-on or account management system can be established according to distributed Session.

The process that

When a user logs in for the first time, the Session information (user Id and user information), such as user Id as Key, is written into the distributed Session. When the user logs in again, obtain the distributed Session and check whether there is Session information. If not, switch to the login page. Generally, Cache middleware is used for implementation, and Redis is recommended. Therefore, Redis has the persistence function, which facilitates the loading of Session information from the persistent storage after the distributed Session breaks down. When saving a session, you can set the duration of a session, for example, 15 minutes. If the duration exceeds 15 minutes, the session will automatically timeout. Combined with Cache middleware, the distributed Session can be realized, which can well simulate Session Session.

6.5 Database Cluster (Read/Write Separation, Separate Databases and Tables)

Large websites need to store massive amounts of data. In order to achieve massive data storage, high availability and high performance, redundancy is generally adopted in system design. Generally, there are two ways to separate read and write and separate libraries and tables.

Read/write separation: To solve the scenario where the read ratio is much larger than the write ratio, you can use one active/standby mode, one active/multiple standby mode, or multiple active/multiple standby mode.

On the basis of service separation, this case combines database separation and read and write separation. The diagram below:

After business separation: each subsystem needs a separate library; If a single library is too large, it can be divided into different libraries according to the business characteristics, such as commodity classification database and product database. After the database is divided, if there is a large amount of data in the table, it can be divided into tables, generally according to Id, time, etc.; (The advanced usage is consistent Hash) To separate read and write based on separate libraries and tables; The relevant middleware can refer to Cobar (Ali, no longer in maintenance), TDDL (Ali), Atlas (Qihoo 360), MyCat (on the basis of Cobar, many people in China, known as the first open source project in China).

The problems of sequence, JOIN and transaction after database and table are discussed in the topic of database and table sharing.

6.6 as a service

Extract functions/modules common to multiple subsystems and use them as public services. For example, the membership subsystem in this case can be extracted as a common service.

6.7 Message Queues

Message queue can solve the coupling between subsystems/modules to achieve asynchronous, high availability, high performance system. Standard for distributed systems. In this case, message queue is mainly used in shopping and distribution links.

After the user places an order, it is written to the message queue and directly returned to the client. Inventory subsystem: read message queue information, complete inventory reduction; Distribution subsystem: read message queue information, distribution;

Currently, Active MQ,Rabbit MQ,Zero MQ, and MS MQ are widely used. You need to select Active MQ based on service scenarios. I suggest you look into Rabbit MQ.

6.8 Other Architectures (Technologies)

In addition to the above mentioned business splitting, application clustering, multi-level caching, single sign-on, database clustering, servitization, message queuing. There are ALSO CDN, reverse proxy, distributed file system, big data processing and other systems.

No details here, you can ask Baidu /Google, if you have the opportunity to share with you.

Vii. Summary of architecture

The above is a summary of the architecture of this sharing. For details, please refer to the previous sharing. There are still many areas that can be optimized and refined. Because it is case sharing, it mainly introduces the important parts. In the work, we need to design the architecture according to specific business scenarios.

The above is a total of three cases of e-commerce website architecture to share, from the requirements of e-commerce website, to the stand-alone architecture, gradually evolved into a commonly used, for reference distributed architecture prototype. In addition to functional requirements, it also has certain non-functional quality requirements (architectural objectives) such as high performance, high availability, scalability and scalability.

Recently, I have been reading two books about large-scale website architecture: Large-scale Website Technical Architecture — Core Principles and Case Analysis, Li Zhihui, Large-scale Website System and Java Middleware Practice, Zeng Xianjie.

I look forward to learning from these books how large sites are architecting and what the problems are. After reading these two books, I concluded two big questions:

1. Why does the technical architecture of the website evolve? Another way to think about it is why do websites get bigger?

2. What are the problems in the evolution process? Or in order to evolve, what are the problems?

Why does web technology architecture evolve

I have personally identified two driving forces for the evolution of our technical architecture, which drives why we are evolving our website’s technical architecture:

1. Internal drive: We expect to do better in our current business and develop more new business

2. External driving force: the rise of users and the diversification of user types

The two drives are not independent, but more often parallel. I think Taobao is the result of two driving forces in parallel.

The reason for this evolution is simple. But when should we evolve the technical architecture of a website, and how? To be honest, I didn’t have any experience in dealing with these problems, and in reality, every enterprise faced different problems at that time. Therefore, it was difficult for me to summarize what was the time of evolution from experience.

But I can approach this problem from another Angle: look at the internal and external structures of the site, and find out where those structures might go wrong. Once you know or anticipate the problem points, you’ll know how to evolve. Similarly, if you understand the structure of a PC, you will know when to add memory and when to add hard disk.

So let’s first look at the external structure of the site:

The external structure is composed of the following parts:

U: indicates the user group. How does our site evolve as our user base changes? For the analysis of user groups, I can know the following dimensions: number, type and geographical location (region).

N: network environment. The network environment is different in every region. You can imagine why we need CDN. How do we evolve our site when we expect users in every region to have a good experience?

S: Security. How safe do we want to be? This is related to the current stage of the site and the nature of your site.

C: Stands for our website. Belonging to an internal structure

Internal structure of the site:

Composition of internal structure:

A: Application services.

D: Data service

To sum up, these components provide a baseline for thinking about whether or how the site should evolve.

So why don’t we just design the site “big” from the start? “Do not attempt to design a large website,” Li wrote in an afterword. “The reason is that the Internet operates according to its own laws, and the short history of the Internet has repeatedly proved that such attempts do not work.” “Large websites are not designed, they evolve,” he said. I need to be reminded of this last statement: “Not by design” does not mean “by design”.

As for “large website design”, my personal view is that now we have the “cloud”, computing can be bought, as long as our design can adapt to the “cloud”, CAN I start to design large websites?

What are the problems encountered in the process of evolution

– the original

Start with a small website. One server is enough.

– Data services are separated from application services

More and more users represent more and more data than a single server can handle. We separated the data service from the application service, and configured the application server with better CPU and memory. Better and bigger hard drives for data servers.

– Use cache

Because 80 percent of business access is focused on 20 percent of data, if we can cache that data, performance improves immediately. There are two types of caches: local caches and remote distributed caches. Which one do you use? Or both? I don’t know yet.

Here’s a question the book doesn’t address: What data should be cached? There should be some rules.

– Use a server cluster

When this server reaches its maximum capacity, it becomes a bottleneck. You can buy more powerful hardware, but there is always a ceiling. At this point, we need a cluster of servers. At this point, you have to add something new: a load-balancing scheduling server.

However, there is one issue to consider when using server clusters: Session management. Session management can be done in the following ways:

If we make sure we use our own Sticky dishes every time we eat, it’s good if we keep our Sticky dishes in a restaurant every time we go to eat there.

The problem with this approach:

1. When a server restarts, all sessions on it are lost

2. The load balancer becomes a stateful machine, which is difficult to implement Dr

Just like we keep a copy of our own at all restaurants. Not suitable for large-scale clusters, suitable for the situation of few machines

Problems with this scheme:

1. The bandwidth between application servers is abnormal

2. A large number of online users occupy too much memory

Cookie-based: similar to bringing your own dishes and chopsticks with you every time you eat

Problems with this scheme:

1. Cookie length limit

2. The security

3. External bandwidth consumption of the data center

4. Performance impact: The server has more content to handle each request

Session server: Can also be clustered. This mode is applicable to the large number of sessions and Web servers

Considerations for such a scheme are:

1. Ensure the availability of the session server

2. We need to make adjustments when we write the application. I don’t know if the application server can make this part of the logic transparent

– Database read/write separation

A portion of the database reads (uncached, cache expired) and all writes still need to go through the database. When the number of users reaches a certain amount, the database will become the bottleneck. Here we use the hot standby function provided by the database to import all read operations to the slave server. Note: Read-write separation addresses the problem of reading stress.

Because the database reads and writes are separated, our application will have to change accordingly. We implement a data access module so that upper-level code writers don’t know about read/write separation. Here, I would like to know if I use ORM model, how to achieve read and write separation?

Database read/write separation may encounter the following problems:

Data replication problem: consider delay, database support, replication conditions support. Don’t forget, with extension rooms, this is even more of a problem. Apply routing problems to data sources – use reverse proxies and CDN to speed up web site response

CDN can be used to solve the problem of access speed in different regions. Reverse proxy caches user resources in the server room:

– Use a distributed file system

– Dedicated database dedicated database: data is split vertically.

This can solve some data writing problems

Problems encountered when splitting a database vertically:

Cross-business transaction applications have configuration items that raise questions about transactions in two ways:

Use distributed transactions to remove transactions or not pursue strong transactions – the amount of data or updates in a business table reaches the bottleneck of a single database: horizontal data split

Split the data of the same table into two databases

Problems encountered in horizontal data splitting:

SQL routing problem, need to know a User on which database. The primary key has a different policy. Performance issues at query time, such as paging issues

Use the search engine to solve the data query problem. Use NoSQL to improve performance. Develop the unified data access module to solve the data source problem for upper-layer application development

– Service splitting and application splitting

As web sites become more and more complex, it becomes impractical to build a single large application to do it all. From the management point of view, it is not convenient to manage. However, it is difficult to find a general model for business separation, which is a mixture of enterprise management issues and technical issues. At the same time, it is related to the specific situation of each enterprise.

But from both books, ultimately architecture is going to be service-oriented, or SOA. How to implement SOA is a big topic that is not the scope of this article.

I took a screenshot from Cheng Li’s 2008 talk to illustrate what a post-SOA architecture might look like:

– Non-functional issues

– Security and monitoring problems

– Release issue: A new architecture means a new release method

– the engine room

— Neither book says anything about extension rooms. I have no experience, but I can guess that all of these questions may have to be reconsidered if an extension room is offered.

– Organizational structure changes

Changes in our technical architecture will inevitably lead to changes in our organizational architecture, and vice versa.

It seems that we should not be in charge of this part, but I think our technical staff should also participate in the design of the organizational structure. For example, organizational structures are designed to deal with performance, and performance sometimes resembles the laws of a country. What happens if a country’s laws are not sound? You know.

We also have to consider the cost of learning the new architecture.

I am currently reading relevant books on this part, but I do not have a systematic understanding.

Conclusion:

– About the order of evolution

In reality, the evolution of a technology architecture is not necessarily outlined from beginning to end, so it depends on the circumstances.

– About traditional evolution versus modern evolution in a cloud environment

Unfortunately, only Li Zhihui talked about cloud, and only clicked — “Now more and more people’s websites from the beginning of the establishment is built on the cloud computing services provided by large websites, all the resources required: Computing, storage, network can buy linear scaling on demand, do not need their own bit by bit to piece together a variety of resources, comprehensive use of a variety of technical solutions to gradually improve their own website architecture.

Because I haven’t been using the word “cloud” long enough to conclude that there is a difference between a cloud architecture and a traditional cloud-free architecture as it evolves.

When it comes to traditional architectural evolution, my own conclusions and reflections are as follows:

There are two main dimensions to consider when adjusting the architecture of a website: data services and application services. In the process of adjustment, it is necessary to distinguish which point is the bottleneck and which point has the highest priority for optimization. At the same time, the most important point: although we are technical personnel, we should also learn business knowledge, so that we can distinguish between business problems and technical problems when considering problems, and then we can apply the appropriate medicine. You have to understand that there are some problems that are not more effective with a technical approach than with a business approach. 12306’s timeshare tickets are a case in point.