Author: LAN Jiangang

Editor: Qi Lin Gao Yang


The site probably started out as an idea: a model of the industry, quickly produced. “Fast” is the first priority, and you don’t need to spend too much energy on architecture.


As the site expands, more effort needs to be put into the architecture to carry the traffic when the site explodes. Ele. me, which was founded eight years ago, now has more than 9 million orders per day. We also have a relatively complete website architecture.


Website infrastructure

Initially, we used a framework that made it easier to scale SOA. We use the SOA framework to solve two things:

1. Work with others

At the beginning of the site, programmers may be 1~5, and at that time it is ok for everyone to be busy with the same thing. We both understand each other’s work and often solve problems by Shouting.


But as the number of people increases, it’s not possible for one person to update the code and put everyone else’s code back online again, right? Therefore, we must consider the problem of division of labor and cooperation.


2. Scale quickly

In the past, the order quantity may be from 1K to 1W, although it has increased by 10 times, the total amount is not very high, nor is it so great for the pressure of a website. The real order volume from 10W to 100W, from 100W to 200W, may only expand by 10 times, but it is a huge challenge to the architecture of the whole website.


Our background is that 1 million in 2014 is now 9 million, and the technical team has grown from more than 30 people at the beginning to more than 900 people now. This is where the division of labor becomes a huge challenge. The integration of services, the integration of teams, these need to be supported by a framework, which is one of the roles of SOA frameworks.


Looking at where we are, in the middle is the architecture of our entire architecture, and on the right are some of the basics related to servitization, including the underlying components or services.



Let’s start with the language, our original site was in PHP, and then we’ll move on.


The founders were college students, so of course Python was a good choice. Python is a good choice for now, but why expand to Java and Go?


Many people can write Python, but not many people can really do it well. As the business grows, more developers are needed. Given the mature Java ecosystem and the emerging Go ecosystem, we settled on an ecosystem where Python, Java, and Go coexist.


WebAPI mainly performs some general operations unrelated to service logic, such as HTTPS uninstallation, traffic limiting, and security verification.


The Service Orchestrator is a Service orchestration layer that implements protocol transformation and Service aggregation tailoring on the Intranet and extranet through configuration.


On the right side of the architecture diagram are some of the ancillary systems that surround these servitization frameworks, such as the Job system for performing a task on a regular basis. We have almost 1,000 services. How are these systems monitored? So you have to have a monitoring system. In the beginning, when there were only 30 people, we were better at going to the machine to search the logs. When there were more than 900 people, you could not search all the logs on the machine. You needed a centralized Log system. Other systems will not be described.


Rome wasn’t built in a day. Infrastructure is a process of evolution. We have limited energy, so what do we do first?


Service split

As the site got bigger, the original architecture couldn’t keep up. The first thing we need to do is:

Breaking up the big Repo into a small Repo, breaking up the big service into a small service, breaking up our centralized base service into different physical machines.

The service split alone took more than a year to complete, which is a relatively long process.


The process starts with a good definition of the API. Because once your API goes live, the cost of making some changes is pretty big. A lot of people depend on your API, and a lot of times you don’t know who depends on your API, and that’s a big problem.

Then you abstract away some of the underlying services. Many of the original services are actually coupled within the original business code. For example, the payment business, when the business is very simple, tightly coupled code is ok, but when more and more businesses need payment services, do you want to do one for each business (such as the payment function)? So we have to take those basic services out of it. For example, payment service, SMS service, push service and so on.

Dismantling services may seem simple and worthless, but that’s exactly what we’re doing right from the start. In fact, during this period, all the previous architecture can be put off, because it won’t kill you if you don’t do the architecture adjustment, but it will kill you if you don’t do the disassembly service.


Service unbundling is bound to be a lengthy process, which is actually a painful process and systems engineering that requires many supporting systems.


Release system

Releases are the biggest source of instability. Many companies have strict time Windows for releases, for example

  • Only two days a week can be posted;

  • Weekends are a no-no;

  • Business peak must not be published;

  • And so on…


We found that the biggest problem with publishing was that there was no easy way to roll back after publishing. Who should perform the rollback operation? Can the publisher perform it, or should someone perform it? If it’s a publisher, and the publisher doesn’t work 24 hours a day online, what if something goes wrong and there’s no one to find? If you have someone to perform the rollback, and there is no simple, uniform rollback, that person needs to be familiar with the publisher’s code, which is almost impossible.


Therefore, we need to have a publishing system, which defines a uniform fallback operation. All services must follow the publishing system’s definition of fallback operation.

It is mandatory for everyone to connect to the distribution system in Ele. me. All systems must be connected to the distribution system. The framework of the release system is very important, and it is actually a very important thing for the company to put in the first priority queue.


Service framework

Then came ele. me’s service framework, which separated a large Repo into a small Repo and a large service into a small service, so that our services could be as independent as possible, which needed a set of distributed service framework to support.


Distributed service framework includes service registration, discovery, load balancing, routing, flow control, fusing, degradation and other functions, which are not expanded here. As mentioned earlier, ele. me is a multilingual ecosystem, including Python and Java, and our servitization framework is also multilingual. This influenced our later selection of middleware, such as the DAL layer.


DAL data access layer

As the volume of business grows, the database becomes a bottleneck.


In the early stage, the performance of the database can be improved by improving the hardware. Such as:

  • Upgrade to a machine with more cpus

  • Change the hard drive to SSD or better

But there is a capacity limit to hardware upgrades. And a lot of business partners, write code when the direct operation of the database, the database has been hit many times as soon as the service online. After the database is destroyed, there is no other opportunity to restore the service except waiting for the database to recover.


If the data in the database is normal, the business can actually compensate out. So when we do the DAL service layer, the first thing we do is limit the flow, and everything else can wait. And then we do connection reuse, the multi-process, single-thread plus coroutine model that we use in Python framework.


Multiple processes cannot actually share a connection. For example, 10 Python processes are deployed on one machine, each with 10 database connections. Scale that up to 10 machines, and you have 1000 database connections. Connection is an expensive thing for databases, and our DAL layer has to do a connection multiplexing.


This connection multiplexing refers not to the connection multiplexing of the service itself, but to the connection multiplexing over the DAL layer, which means that the service has 1000 connections to the DAL layer and may only maintain a dozen connections to the database after connection multiplexing. Once a database request is found to be a transaction, the DAL preserves the connection correspondence for you. When the transaction ends, the connection to the database is put back into the pool for others to use.


Then do the smoke and fuse. Databases can also be fusible. When the database smokes, we kill some of the database requests to ensure that the database does not crash.


Service governance

After the service framework comes the question of service governance. Service governance is actually a big concept. The first is the burial point, you have to bury many, many monitoring points.


For example, if you have a request, the request succeeds or fails, what is the response time of the request, put all the monitoring metrics on the monitoring system. We have a big monitoring screen with a lot of monitoring indicators. We have a team looking at this screen for 72 hours, and if there are any fluctuations in the curve, get someone to fix it. The other is the alarm system, a monitoring screen is always limited, only those key indicators that are important. That’s when you need an alarm system.


Rome wasn’t built in a day, and its infrastructure was a process of evolution.


We have limited resources and time. How can we as architects and Ctos produce more important things with limited resources?


We built a lot of systems, and we thought we were doing great, but we weren’t, and I felt like we were back in the Stone Age, because there were more and more problems, more and more requirements, and there was always a feeling that there was something missing in your system, and there were a lot of features that you wanted to do.


For example, for the flow control system, now we still need the user to match a concurrency number, so this concurrency number, does not need the user to match? Can we automatically control concurrency based on a state of our service itself?


Then there’s how to upgrade. SDK upgrades are a pain. For example, when our Services Framework 2.0 was released last December, there are still people using 1.0. Whether the SDK can be upgraded without loss, we can control the timing and pace of the upgrade.


In addition, our current monitoring only supports aggregation on the same service, regardless of cluster or machine, so can the indicators be divided into clusters and machines in the future? To take the simplest example, say there are 10 machines on a service, there may be a problem on just one machine. But all of its metrics are spread equally among the other nine machines. You just see an increase in overall service latency, but it could just be that one machine is slowing down the entire service cluster. But we don’t have any more dimensions right now.


There are intelligent alarm, this alarm, is fast, complete, accurate, we now do faster, do more complete, how can we achieve more accurate? The daily alarm volume peak time more than a thousand alarms issued a minute. Are all the thousand alarms useful? When you call the police too many times, you don’t call the police at all. People get tired, so they don’t go. How can I distinguish this alarm more accurately? More intelligent link analysis? In the future, our monitoring should not put monitoring indicators, but link analysis, so that we can clearly know which node has a problem corresponding to this problem.


These problems relate to one of our principles: enough is enough, but we should be able to plan ahead.



IT Technology Group, looking forward to your joining

The background replies to the “join the group” audit invitation