About Aliware
Aliware is an Alibaba middleware technology brand, which includes 5 middleware products, mainly EDAS, DRDS, MQ, ARMS and CSB. Since 2007, Aliware has experienced more than 8 years of double 11 promotion, each promotion can make the product system to a higher level. Open source products like JStorm, Dubbo, Rocketmq, etc., are very popular on both GitHub and Apache top projects.
The origin of servitization
In 2007, ali’s technical RESEARCH and development team was about 500 people, and its main business was Taobao.com. They were all deployed in a single WAR package, based on the traditional JAVA EE application development architecture, using Oracle database and JBoss server. At that time the whole Taobao is two WAR bags, one is the front desk, is Taobao; There is also a back-end CRM system for all customer support staff.
At that stage, we were faced with a lot of problems:
The first problem is that the development cost of the system is very high.
First of all, hundreds of people maintain a core project, the source code conflict is serious, high collaborative cost. Taobao at that time was a separate WAR package, when running, is a project, is a code. Whether it was SVN in the past, or today with Git and other tools, the problem of code conflicts is unavoidable.
Second, the project release cycle is too long. In those days Taobao, is a chimney type website. The bottom layer is a database, then the top layer is a DAO layer for all business logic, which is responsible for accessing the database, and then maybe the business layer. The logic of all modules is in one system, all deployed together. This will affect the release of the entire site because of the inefficiency of the development of a few modules.
Errors, however, are difficult to isolate. This was the killer problem of the day. For example, if I make some changes to one of the time modules or one of the if logic, the whole activity page will go wrong and the whole site will become unavailable.
The second problem is that the database capacity has reached its limit.
In the early stage of Taobao, oracle database was used. The number of oracle database connections on a single machine was insufficient, and the IOPS of a single machine reached the bottleneck. The CPU of the database was running at 90% of its load every day, and the machine went Down at least once a year.
The third problem is data silos
At that time, the data of taobao, Tmall, Juhuasuan.com, Wanwang and other business systems were completely isolated. The data were inconsistent and could not be reused, the accounts were not unified, and associated recommendation and big data analysis could not be carried out.
Formation of microservices architecture
After these three problems appeared, Taobao began to do some servitization exploration. Starting in 2007, some microservices architecture changes have been made.
RPC framework: The core foundation of microservices architecture
At the bottom and the core of Alibaba’s servitization are two frameworks. The first is Dubbo framework. The Dubbo framework was created in 2010 and opened in 2011. Now Ali has developed to the third generation OF RPC framework, in the internal code called HSF framework, currently more than 90% of applications, are using such a framework. It is also used every year for the Singles’ Day Promotion.
Message queuing: Asynchronous calls decouple the system
The focus of the RPC framework mentioned above is to help us solve the problem that when a website is servitization split, the connection between various modules needs to be made a synchronous call through the RPC framework. Then there are some scenarios that actually do not need synchronous call, but can be solved asynchronously.
For example, the mobile phone recharge business on Taobao.com seems to be a serial recharge process, but it can be solved by asynchronous structure. Firstly, it helps users to ensure that their orders have been completed on the e-commerce platform through synchronous call. Secondly, asynchronous decoupling is carried out through message components, so that some things that take a long time but are not core links can not occupy the main process time of consumers in using websites and APPS, and optimize user experience.
Based on this, there are three main categories of problems that our message middleware will solve.
The first is reliable synchronization, where messages are reliable and ordered, which is used in all cases where stability is needed to improve the transaction link. The second is reliable asynchrony. When there is a demand for stability and throughput, the logic of asynchrony can be adopted. Through asynchronous feedback, the message middleware can deliver repeatedly to ensure stability. The last one is one-way, focusing not on stability but on throughput.
I have specially sorted out the above technologies. There are many technologies that can not be explained clearly by a few words, so I simply recorded some videos with my friends. The answers to many questions are simple, but the thinking and logic behind them are not simple. If you want to learn Java engineering, high performance and distributed, simple. Micro services, Spring, MyBatis, Netty source analysis of friends can add my Java advanced group: 680130298, group of Ali Daniel live explain technology, and Java large Internet technology video free to share to you.
Mass configuration push
After the servitization split, the configuration used by each service needs to be centrally managed. Therefore, we developed a reliable configuration push service that can complete configuration push in milliseconds and support the query of change history and push trajectory.
Three-dimensional monitoring
Monitoring is very important for the overall performance of the system. Therefore, we try to collect information from different levels to achieve three-dimensional monitoring of applications, including resources, containers and services, including the following three aspects:
System resources: load, CPU, memory, disk, network
Containers: heap memory, class loading, thread pools, connectors
Services: Response time, throughput, critical link analysis
Service monitoring
When originally in the centralized system architecture, each page will run through many modules, each module is coupled in a system, the final monitoring is the appearance, it is impossible to know which module or function is slow in logic page opening. Now, we can monitor the real-time invocation of each service interface and method, and monitor the life cycle of each service and the monitoring indicators of each service operation in detail. We can also call QPS and make statistics on response time, and quickly perceive the change of system traffic.
Taobao around EDAS technology system with a complete set of service innovation, in the process of the transformation, the first will be the highest data reuse the data of split, stripping out the user center Shared service layer, on the upper all business users all logic, then there thousand island lake project, multicolored stones, Behind these projects are a series of spin-off products of service centers. After 6-7 years of service evolution, the number of service centers has reached more than 50.
The picture shows alibaba’s core services
The architecture. Independent innovation out of the technical dilemma, precipitation of a large number of mature middleware technology, the bottom layer is shared middleware and components, and precipitation of Ali Cloud technology support products; The shared service system breaks the application of “chimney” construction mode to support rapid business innovation; Cloud infrastructure supports business growth efficiently, and its flexible scalability brings huge cost savings.
Large-scale servitization challenges
With the separation of servitization, all the systems will become more and more. The arrow points to the bottom servitization center, and the upper layer calls to the front end business system. Many systems call many service centers, and there is no architect left to help us with service dependencies and architecture combing.
EDAS Hawk-eye monitoring system
When we were investigating some online problems, it was not required to help me solve the problems in a very fast and intelligent way, as long as such a system could help me quickly locate the problems, so Alibaba developed a system of EDAS Hawk-Eye monitoring.
As can be seen from top to bottom, the Hawk-eye monitoring system can very quickly locate the fault, and through the means of visualization, it can be found on the system is caused by which log on which machine. This was the first thing that Hawkeye did.
What’s the second thing that Hawkeye does? When we put all the similar request call links together for analysis, we can carry out data acquisition in a very short time, and have the operation of data. Peak QPS refers to the maximum QPS reached by a business service in a minut-level service invocation at a business peak today. As you can see from the markup in the figure, even though the page is exposed to the front, it is not necessarily the most stressful, which is the value of data visualization. We also need to help make decisions with data. The biggest value of data is that it can accurately inform us of the maximum pressure point.
When a page is opened through a series of system calls, there will always be a problem at a certain point, which is called the point of failure. We can visually see which component of all requests has the highest error rate in the past day, so we can solve it accordingly.
EDAS capacity planning
Ali internal how to do some capacity of some planning? First of all, we will go to manufacture some flow, through the real flow pressure test part of the stand-alone performance, and then according to the set running water level calculation system bearing the highest capacity, so as to finally realize the machine as needed online and offline, these systems together, is the function of the overall capacity planning. All pressure measurement on a single machine will set some indicators, when we cluster half of the machine flow to the other half of the time, the QPS of all the flow will double, when the performance of the single machine if not reached the operating water level, will continue to drain, until the target is reached.
EDAS traffic limiting degradation
Here to provide you with a learning platform, Java architect group: 680130298
-
Those with 1-5 work experience, who do not know where to start in the face of the current popular technology and need to break the technical bottleneck can add group.
-
After staying with the company for a long time, I was very comfortable, but I hit a wall in the interview when I changed my job. Need to study in a short period of time, job-hopping can be added to the group.
-
If there is no work experience, but the foundation is very solid, Java working mechanism, common design ideas, common Java development framework master proficient can add group.
Throughout the double 11 period, at different points in time, we are faced with different core and non-core services. For example, at zero o ‘clock of Double 11, the traffic peak basically comes from all the payment links. Therefore, at that stage, we should put all resources towards transaction and payment. And by the time you wake up the next morning, logistics is central. Today we are going to look at the core and non-core of a website from a business perspective. There is a visual configuration interface in EDAS to help you at a certain stage, which service is the core service, so that the core service can call more of the underlying resources, but at other points, it will be restricted.