5 years server from 0 to 200, a startup architecture savage growth history

This post was originally posted on InfoQ: Talk about architecture,
5 years server from 0 to 200, a startup architecture savage growth history, forwarded to InfoQ public account:
The five-year history of a startup’s structure.

Founded in 2013, Beitchat is a working platform for Parents of Kindergartens in China. It is committed to helping kindergartens solve the pain points in parents’ work such as display, notification and communication through Internet products and customized solutions, so as to promote the harmony of home relations. Beitchat is the only brand jointly invested by Vtron (the first share of A-share preschool education), Tsinghua Enlightenment and netease. In just a few years, the user scale quickly reached the level of tens of millions, and DAU grew by multiple levels every year. In the face of such rapid development, the original technical architecture is difficult to support more and more complex business scenarios, in terms of system availability and stability, have brought great pressure to the technical team of Beitchat. Therefore, how to choose the appropriate technical architecture according to the current requirements and ensure the smooth evolution of architecture is worth our in-depth thinking.

There are three important historical stages in the evolution of Baychat architecture

Beitchat architecture has gone through three major processes, from a single architecture built by a few servers to a distributed deployment architecture with hundreds of servers at present. During the whole process of change, we have stepped on many pits and encountered many major technical challenges.

Birth – Technical architecture selection V1.0

In the early stage of our startup, the selection of architecture was mainly based on the following considerations:

1. In the early stage of entrepreneurship, r&d resources, r&d manpower and technical reserves are limited, so it is necessary to choose a technical architecture that is easy to maintain and simple;

2. The product needs to be developed and launched quickly, and can meet the requirements of rapid iteration. The reality determines that there is no time and energy to choose an overly complex distributed architecture system at the beginning, so the development speed must be fast;

3. In the early stage of entrepreneurship, the business complexity is low and the business volume is small. If you choose an overly complex architecture, it will increase the difficulty of r&d and operation and maintenance.

4. Choose appropriate technology instead of the best technical principle, and weigh r&d efficiency and product goals. At the beginning of the startup, Beitchat has only one PHP developer, so the overly complicated technical architecture will inevitably bring high learning costs.

It was based on the above considerations that the classical LNMP technology architecture was finally selected, and the Beitchat V1.0 architecture was born in this way. In order to speed up product development and put the product online as soon as possible, the development and deployment of the first phase was realized through outsourcing companies, and our PHP r&d personnel took over and carried out subsequent iterative development.

In the initial deployment, three ECS servers were deployed, in which the access layer Nginx and the system were deployed on the same machine, RDS database was deployed on one machine, and Memcached was deployed on one machine. V1.0 architecture has the following characteristics:

Single structure, simple structure, clear hierarchical structure;

Rapid research and development can meet the requirements of rapid product iteration;

No complex technology, low cost of technical learning, low cost of operation and maintenance, no need for professional operation and maintenance, saving expenses.

LNMP architecture has supported the business development of Beitchat for nearly one and a half years. The simple and maintainable architecture has made great contributions to the rapid development of Beitchat. During the period of rapid business development and increasing user volume, the original architecture has gradually exposed more and more problems.

Growth stage – Technical architecture refactoring V2.0

I joined Beitchat at the beginning of 2015. The initial r&d team was only three people, and I was lucky to be in this period

Led the reconstruction of Beitchat technical architecture, and experienced the subsequent evolution of beitchat architecture, reconstructing the original PHP single architecture into JAVA distributed architecture.

First of all, let’s talk about the opportunity for us to do technical architecture reconstruction. It’s not how we do it, but when we start to do it, so the opportunity for us to do architecture reconstruction is mainly based on the following points:

1. The original LNMP architecture has been developed and maintained by two teams, the outsourcing team and the COMPANY’s PHP developer. Due to rapid business changes, many problems gradually emerged in the original database design, including unreasonable table design, unclear field definition and confusion.

2. In 2015, due to business development, Beitchat APP needs to be split into two clients: Talk to Mr About parents side and the teacher side, with a different client service to different user groups, to achieve the purpose of accurate operation, if continue to develop on the original architecture, leads to mix of old and new interface logic, and many early interface definition is not very standardized, maintain more and more problems, more and more demanding.

3. The original API interface system is monomer architecture, which contains various interfaces, a mixture of each business logic processing, all functions are concentrated in the API interface system, the code is very bloated, business is very complex, the speed of iterative development also gradually slow down, all kinds of performance problems, often is a BUG, and deployment is difficult, Any changes must be deployed as a whole each time. Because business logic is mixed together, it takes a long time for new r&d personnel to be fully familiar with the system, and it is difficult to quickly access the system in the way of point and surface;

4. All data is stored in a database of RDS, USES only one master library, not from the library, at the same time a lot of systems share a database, on the one hand, the database failed to do physical isolation, on the other hand, many are put in the same database table, often run into a slow query or other performance problems, cause the whole RDS each index soared, Causing an avalanche effect where all systems chain down and are ultimately inaccessible;

5. Public service coupling is serious, a lot of third-party services are scattered in various systems, is not convenient to unified maintenance, when need to modify the parameters of public service or other adjustment, to into the each system need to be modified or screen, is very troublesome, also very easy to appear omissions, eventually produce a BUG, need a independent split out of public services, The public service is decoupled from the business system and maintained and deployed independently by special personnel.

6. Our new R&D team has rich experience in JAVA distributed architecture design, high concurrency and high availability experience, so it is natural to reconstruct the original single architecture into JAVA distributed architecture.

Due to the rapid development of the company’s business, it is impossible to stop and rebuild the technical architecture. Instead, we choose to rebuild the new technical architecture on the basis of maintaining the existing system. During the reconstruction, with the strong support of the original PHP research and development team, our reconstruction work was quite smooth, which not only ensured the rapid iteration of business requirements, but also successfully completed the new technical architecture reconstruction. The new V2.0 architecture is as follows:

In V2.0 architecture period, preliminary realized distributed deployment architecture, according to the different functions and business logic, complete resolution of the system level, to decouple the third-party service at the same time, break out of the independent service module, in view of the DB, we implemented system level split and physical deployment of independence, and implements the database master-slave separation, at the same time introduced the MQ message queue, SLB is used to achieve load balancing and unified bandwidth entry.

Architecture in the V2.0 era has the following characteristics:

Distributed deployment architecture, easy system expansion;

System-level split, split the business function logic independent subsystem, and split DB;

It preliminarily realized servitization, and used Hessian to implement RPC between systems.

The DB is physically isolated to prevent the failure of a single DB database from causing service chain failures. At the same time, the primary and secondary databases are separated.

MQ message queue is introduced to realize message and task asynchronization, accelerate interface response speed and improve user experience. At the same time, some message push tasks are also realized asynchronization, avoid early polling MySQL mechanism, reduce message push delay and improve message push speed.

SLB is used to achieve Nginx load balancing. In the V1.0 architecture period, our Nginx is a single point of deployment, if one Nginx server fails, it will affect many business systems, there is a single point of failure risk. SLB is used to achieve multiple Nginx load balancing to achieve high availability and avoid single point of failure.

System split and DB split

For system split and DB split, we completed the work in two stages.

The first stage

First take apart at the system level, the original system of large split multiple business logic independent subsystems, and DB temporarily not split, multiple systems continue to share a DB, just according to the division of business logic each system depends on table, cannot access to the table each other between different business logic system, so that the new system only access to their assigned table, Through this scheme, the original system business can not be affected, and the research and development of the new split business system can be carried out smoothly. It took us several months to complete the split at the system level.

The second stage

After the completion of the system level split, we followed the implementation of the DB level split, the subsystems rely on the table independent split, respectively placed in different RDS database, to achieve physical isolation, at the same time to achieve the separation of the database master and slave. The final result is as follows:

Preliminary servitization

At this stage, we use the relatively simple and easy to use Hessian to achieve the initial RPC servitization. For third-party public services, it is decoupled from the original system and separated into service-oriented components, which are independently deployed for unified invocation by other business systems. The intersystem call also realizes RPC remote call through Hessian.

SLB load balancing

During the V1.0 architecture, our Nginx was a single point of deployment. Once one Nginx server failed, it would affect a large number of business systems, which is very risky, as shown in the following figure:

During the V2.0 architecture, SLB was introduced to realize load balancing. SLB configured with multiple Nginx, and realized load balancing at the business system level to avoid single point of failure and achieve the purpose of high availability.

Outbreak – Microservices Architecture V3.0

Entered since 2016, bei about high-speed business development, user scale growth in a short time millions, various business line gradually rolled out at the same time, the business scenario is more complex, the code size expansion is also very fast, the r&d team quickly reached the scale of dozens of people, more than a system development, the developers have different level, standard is difficult to unity, serious business logic coupling at the same time, Every time it goes online, the whole system needs to be packaged and put online, which is very risky, and the learning cost of new employees is very high. Therefore, we introduced the micro-service architecture to split the business logic into independent micro-service components. Each micro-service is built around specific business, developed and maintained by specially-assigned personnel, and optimized for performance and architecture. The development and launching of each micro-service component do not affect each other.

Combined with the V2.0 architecture, Dubbo was chosen as the distributed micro-service framework for the implementation of micro-service architecture based on various considerations.

Mature high-performance distributed framework, many companies are using, has withstood all aspects of performance test, relatively stable;

It can be seamlessly integrated with the Spring framework, which is exactly where our architecture is built, and access to Dubbo is non-intrusive and convenient.

Service registration, discovery, routing, load balancing, service degradation, weight adjustment and other capabilities;

Open source code, can be customized according to the needs, expand functions, self-research and development;

When doing microservices, we consider the following key points:

With service as the center, everything is a service, and each service is encapsulated for a single business to ensure functional integrity and single responsibility;

Loose coupling, functional independence between services, can be deployed independently, inter-dependency between services;

High scalability, distributed resources, teamwork, unlimited scalability, higher code reuse rate.

In the implementation of micro-service architecture, the following aspects are mainly considered:

Independent function logic split into microservices, independent deployment, independent maintenance;

All system functions are realized by calling microservices, and the system cannot access DB directly.

Dubbo long connection protocol is used for small data volume and high concurrent calls. Hessian protocol is used for large data volume services such as files, pictures, and videos.

Each microservice maintains a separate DB.

Microservices split case

1 Class dynamic micro-service

Beitchat class dynamic is a function with high frequency of use. Principals, teachers and parents can publish dynamic in class and interact with each other through thumbs up and replies. With the rapid development of Beitchat business, the scale of users exploded, generating hundreds of thousands of class dynamic volume every day, while the daily reply volume and the number of likes reached millions. In the face of such a large amount of data, on the one hand, we will deal with the performance of the high concurrency stress, pressure to deal with data again on the other hand, the original class dynamic capabilities scattered in the API interface system and background management system, related tables also Shared with the original system to a DB, urgent need we split out the independent class dynamic service components, At the same time also need to do sub-database sub-table to reduce the single database pressure. Therefore, we specially transferred the skilled r&d manpower and separated the class dynamic micro-service component.

The old class dynamic call is as follows

The class dynamic microservice component is invoked as follows:

After splitting out the class dynamic micro service, we solved the following problems:

Class dynamic microservice is transparent to the business caller, who only needs to call the interface without paying attention to technical implementation details.

Code reusability: dynamic business logic of the class can be isolated and made into an independent micro-service component. The business system will no longer scatter dynamic business logic codes of the class, and code copy is no longer needed.

Using DRDS implemented depots table, solve the single database data bottlenecks, data processing ability is limited, in the case of a single database, due to the amount of data is large, high concurrency, often encounter performance issues, interface response speed is very slow, after the implementation of depots table, dynamic interface class promoted several times, the overall performance of the user experience is very good, There are no performance issues during high concurrency periods.

User pass microservice

Many startups, development in the first place, in pursuit of speed, at the same time due to the shortage of manpower, is according to user data table and data business inside the watch is temporarily placed in a DB, Mr Chat early also is such, this creates is written for each business system is the DAO to access to user data, produced a large number of duplicate copy user logic code. With the faster and faster development of business, more and more business systems need to access user data, and user logic codes are scattered in various business systems. User data is becoming more and more difficult to maintain and complex. At the same time, the number of users is increasing, and high concurrent performance is often encountered, making it difficult to optimize independent performance. Hence the urgency of spinning off a separate user pass microservice.

Old user data acquisition method

User pass micro service

After splitting out the User Pass microservice, we solved the following problems:

Code reusability. In the past, almost every business system had scattered user logic codes and copied codes everywhere. After splitting the user pass microservice, the business system only needed to call the user pass microservice interface.

User data consistency, previously due to access and modify the user data code scattered in various business systems, often can produce some users dirty data, and it is difficult to query in which system changed the user data, at the same time due to the different developers to develop different business system maintenance, also to maintain user data consistency brought a big challenge, After the separation of user pass microservice, all functions related to user logic are provided by user pass microservice, which ensures the consistency of interface for modifying data and obtaining data.

User data is decouple. The original service system often joins user tables to obtain user data, which is difficult to split. After splitting microservices, user databases are designed and deployed independently, facilitating capacity expansion and performance optimization.

Microservice governance

The complexity of microservice architecture development, testing, and deployment is much greater than that of individual architectures. Therefore, delivery, operation and maintenance capabilities that can support microservice architecture need to be built.

1 Release system

Micro service architecture and application development, deployment, the complexity of the applications are greater than the monomer architecture, a large number of micro service component if still by the operations staff manual configuration management is obviously difficult to deal with, so we developed the automated deployment and release release system, we release system has the following features:

Project configuration includes project name, administrator, project member, SVN/Git address, account, Shell for service startup, custom script, JVM configuration for different environments, Web container configuration, etc.

After being configured according to the project, the online application can be initiated. After being approved, deployment can be done with one click.

Support grayscale release, you can select the grayscale server for version release, to ensure the security and stability of version release;

Logs generated during the deployment process can be collected in real time to visually monitor problems generated during the deployment process.

For release exceptions, we have release exception processing mechanism. In the case of multiple servers, we can choose to stop publishing as long as there is failure, that is, one server fails to publish and the rest servers stop publishing later, or we can choose to continue publishing regardless of failure.

Fast rollback, in case of a release exception, we support fast rollback to the last stable release.

Through the version release system, code version management, one-click deployment online, one-click quick rollback, online order application, online audit and online log are realized.

2 development, test, release and deployment

For the complex architecture of microservices, we deployed four environments to ensure the quality of each microservice delivery:

Development environment for r & D personnel to use in the development and debugging stage;

Test environment, which is deployed to test personnel for acceptance after completing all function development and testing in the development environment;

Pre-release environment, after complete the function of the test environment acceptance, a preview function before release to production environment environment, production environment and share the same database, caching, MQ message queue, etc., used in micro service online before production, confirm whether there are bugs and other issues, will not affect the production environment of users, eventually used to ensure the success of the online production environment;

The production environment, or online environment, is user-oriented online environment.

Ensure the quality of development, testing and release of microservice components through the above four environments.

3 Distributed configuration center and distributed task scheduling platform

With the implementation of microservices architecture, we split up a lot of microservices and subsystems. Various configuration information is configured in plaintext in the configuration file, and various scheduled tasks are scattered in each microservice and subsystem, which is very difficult to manage. Therefore, we choose the appropriate distributed configuration center and distributed task scheduling platform

Disconf distributed configuration management platform realizes unified configuration publishing. All configurations are stored in the cloud system, and users can publish and update configurations on the platform in a unified manner. When changing configurations, there is no need to repackage or restart microservices, but can directly modify configurations through the management platform. All configuration information is encrypted to prevent sensitive information such as accounts and passwords from being leaked.

Elastic-Job The Elastic-Job distributed task scheduling platform provides features such as a scheduled task registry, task fragmentation, flexible capacity expansion, failover, task stop, recovery, and disabling of task running servers, facilitating scheduled task management in a distributed architecture.

4 Full link tracing

Micro service architecture break up a great deal of subsystems and micro service components, in the face of such a complex large-scale distributed cluster, a link call may occur between the multiple micro service components, how to link call tracking, how to quickly find a interface call which place need in the process of optimization, the micro service interface which led to the whole call slowly. To solve the above problems, we introduced Cat real-time monitoring system of APM tool of Meituan-Dianping, which was integrated with Dubbo servitization framework to realize link tracking function through global link ID.

5 Micro-service Authorization

Default Dubbo not authorized service functions, the system calls between micro service, micro service invocation is not implemented authorization verification, are direct access to the service component interface, so we have with the personalization of Dubbo in research and development, research and development of the micro service authorization certification center, through authorized certification to ensure nuclear core service interface call security.

6 Micro-service monitoring

Split out of a large number of service components, we are facing is how to monitor so many micro service running status, we have adopted the Dubbo own simple monitoring center, monitoring the service component of the success rate, failure rate, average time, maximum time consuming, concurrency value and other indicators, through these indicators found micro service performance bottlenecks, and optimize the service performance. At the same time, we carried out personalized customization expansion and research and development. For Dubbo interface call, interface time ranking, interface failure ranking, interface access change ranking, and so on, we could monitor Dubbo interface performance more intuitively through customized research and development statistics reports.

7 Microservice management

We use the management console of Dubbo to realize the functions of routing rule configuration, access control, weight adjustment, service degradation, service disable, fault tolerance and so on, which can be very convenient to manage the microservice components in operation.

After more than a year of microservitization, V3.0 architecture is as follows:

The V3.0 microservice architecture has the following characteristics:

Fully implemented distributed deployment architecture, the system and microservice components are very easy to scale;

Taking service as the center, the microservice component is built comprehensively.

The system, microservice components, cache, MQ message queue and DB all have no single point of risk, and all have realized HA high availability.

Future – Baychat Architecture Evolution V4.0

Although the V3.0 architecture implements the microservices architecture, it still has the following points that can continue to evolve:

Docker container deployment, Docker has the advantages of lightweight, rapid deployment, application isolation and cross-platform, microservice is very suitable for rapid deployment combined with Docker. Although we have realized the microservice architecture at present, rapid and elastic expansion has not been achieved, if the combination of microservice and Docker container, Rapid and elastic expansion can be realized. The server can be rapidly and automatically expanded in business peak periods, and the server can be automatically reclaimed in business peak periods. Next, we will implement Docker container deployment of microservice components.

Unified API gateway. At present, our core API is only a unified proxy layer, which does not have the functions of gateway such as identity authentication, anti-packet replay and anti-data tampering, service authentication, traffic and concurrency control, etc. After the implementation of API gateway, the front and back ends can be separated and convenient monitoring, alarm and analysis can be provided. It also provides strict permission management and traffic restriction to ensure API security and stability. Next we will implement unified API gateway control;

Deployment across the IDC room, at present our deployment system or a single room, single room does not have redundant and disaster mechanism, first of all, we will gradually implement deployment, more room at the same place and to have the ability to deploy more room and single room avoids failure, finally, we implement different deployment across the IDC room again, achieve different redundant high availability and the purpose of user access to the nearest.

conclusion

Architecture evolution is always on the way. Architecture should be carried out around the business, not separated from the business. Different business periods require different architectures.

The single application architecture is more suitable for the initial stage of entrepreneurship. The company needs rapid trial and error and verification of market response, and needs faster r&d speed. At the same time, there are fewer r&d personnel. However, when designing a single application architecture, it is better to plan future expansibility in advance and at the business level, so as to facilitate the business development to a certain scale in the future, quickly decouple and implement the micro-service architecture.

When enterprise development to a certain size, lines of business have become more and more, more and more complex, the number of developers is the fast growth at the same time, will slowly exposed shortcomings of single application architecture, a lot of research and development personnel to develop in a system, the lack of parallel research and development capabilities, a large number of business code coupled together, at the same time, research and development efficiency is very low. Microservice architecture can better decouple services, have better scalability and independence, can improve the speed of parallel research and development between r&d teams, improve efficiency, improve module reuse, and have high availability and high concurrency characteristics. However, microservice architecture has higher requirements on service governance capability, higher maintenance costs than single applications, and requires strong support of service governance, as well as higher requirements on technical capabilities of r&d personnel.

At present, we are still on the path of architecture evolution. Although we have made some progress through the above architecture process, there are still many challenges waiting for us to meet. Planning the technical architecture requires comprehensive consideration of the scale of the business, the timeliness of the business, the size of the R&D team, the technical capability of the research and development, and the configuration of the basic environment. Architecture comes from the business, and only when the life cycle of architecture evolution perfectly matches the life cycle of the business can the best effect be finally brought into play.

5 years server from 0 to 200, a startup architecture savage growth history

There are three important historical stages in the evolution of Baychat architecture

Birth – Technical architecture selection V1.0

Growth stage – Technical architecture refactoring V2.0

Outbreak – Microservices Architecture V3.0

Future – Baychat Architecture Evolution V4.0

conclusion

Related Posts

What is a hash table? What is the Hashish conflict? How does hashMap work?

Backend Programmer Must-have: 30 Tips for Writing high-quality SQL

Did you really answer the TCP interview question correctly?