The author | Xu Lin zhuang

It was 7 days from the moment we knew about the SAE product to the launch of the whole product. The core application API gateway went online in 3 days. On the 5th day, 100% of the traffic was transferred to SAE after verification. On the 6th to 7th day, the other 30 systems were quickly transferred to SAE. After using SAE, operation and maintenance efficiency increased by 70%, cost decreased by more than 40%, and capacity expansion efficiency increased by more than 10 times. This is a visual change for us. _ _ — Chuang Hsu Lin, CTO of Pumpkin Movies

Pumpkin Movie, founded in 2015, is a streaming media platform with rapid development in China in the past two years. With the business model of no advertising and pure payment, it has gained a certain popularity among fans. After that, relying on strong community interaction (AI intelligent recommendation, film review interaction, online “cloud viewing” through the screening hall, etc.), it rapidly achieved membership growth and occupied the market position of streaming media. Next, it will gradually develop into diversified video platforms: documentaries, all kinds of homemade programs and so on.

As an industry on the Internet, traffic and life cycle will have different performance due to the change of market direction, which puts forward higher requirements for innovation and low-cost trial and error of enterprises. With the rapid development of the business, the overall application architecture of Pumpkin Movies continues to evolve. Today, I would like to share this development process with you from three parts:

Pain points: Review the business, structure and pain points of pumpkin movies at the time.

Selection: Share our thoughts and decisions on the technology selection process and why we chose to use SAE.

Actual combat: how we landed step by step, in just 7 days the entire platform hundreds of servers, more than 30 systems fully Serverless.

Pain points

Since its inception, the overall application architecture of Pumpkin Movie has been built on Ali Cloud, which is a typical enterprise “born on the cloud and growing on the cloud”. The underlying layer uses Ali Cloud ECS, and the infrastructure, middleware, database, big data services and cloud security all use Ali Cloud products to maximize the value of the cloud. On top of the basic services is our self-developed capability center, which provides membership, adaptive bitrate, search engine, film review, screening room and other services based on algorithms and video enhancement capabilities. Provides services to various users through SLB global scheduling and WAF secure access. The upper layer undertakes multiple terminals, basically covering all terminal types in the market, including mobile phones, pads, web pages, various clients, vehicle-mounted devices, etc.

Pumpkin movie initial application architecture

However, with the continuous development of business, ecS-BASED operation and maintenance architecture has gradually exposed many problems, including:

1) Elastic capacity expansion is too slow: during peak traffic, new machines need to be temporarily purchased and deployed one by one, which is very time-consuming and cannot guarantee the system SLA.

2) version slow & error-prone: ** Internet frequent release is normal, but each time hundreds of servers a deployment version is very slow, careless error. I have also tried scripted deployment, and it is really easy to run smoothly, but when there are many server groups and scripts are constantly changing, it is very difficult to locate problems in case the middle gets stuck.

3) High system maintenance costs: ** Traditional cluster operation and maintenance is cumbersome and requires high skills: not only master Lua/Ansible scripts, but also know cloud product network configuration and monitoring operation and maintenance. In the early stage, the company did not have full-time operation and maintenance personnel, which consumed a lot of development energy, which was very painful.

4) Difficult capacity planning and low resource utilization: ** In the streaming media industry, the peak time is usually at noon or evening, and access at other times is relatively low, but it is difficult to prepare capacity. We tend to maintain our servers for a long time on a peak basis, and resource utilization is relatively low.

5) Cumbersome permission allocation: ** In the face of multi-tenant enterprises, permission isolation is often a headache. In particular, when new employees arrive at their posts or cross-team coordination, it is very tedious to configure user groups, RAM permissions and new machine login and connection methods, and account management personnel often become bottlenecks.

A hit movie has accelerated the thinking of upgrading pumpkin movie technology

I believe that there will be many enterprises also face the same problems as us, but also restrict the development of the company. But there is a certain inertia among developers, who think that as long as nothing happens, they will continue to waste. But what really made us decide to upgrade our technology was the hit movie in nineteen nineteen.

In the morning, I received a call from my classmate saying that the business was under great pressure. I said, “It is impossible, generally there is less traffic in the morning.” He said, “I don’t know, all kinds of businesses have started to give warning. Later, I learned that the number of newly registered users exceeded 80W+ within one hour (more than 5 times of the usual peak), which was a great challenge and opportunity for pumpkin movies. Soon the server crashed, the gateway of the gateway API failed, and the backend service and database failed.

Everyone tense nerves, started the whole link emergency capacity expansion: from buying ECS, uploading scripts to new machines, running scripts, expanding DB… . The whole process affected users intermittently, and some users were directly inaccessible for four hours before it was finally fully restored.

Because the platform is all paying customers, our customer service phone was busy from morning to evening, and there were continuous complaints from users, saying that they could not use it in the morning and demanding compensation.

So, a surprise attack like this is a great exercise for the team and a great loss for the company. We compensated all users who opened the APP that day: it was free, and it was a loss of business. However, thanks to the movie, the daily new registered users of Pumpkin Movie soared and the business grew significantly. But looking back at the whole operation, it took four hours. It was so thrilling that we didn’t want to go through it again.

The selection

In view of the above problems, we were thinking about how to transform the next step. At that time, there were two internal plans, but both of them had some drawbacks:

** Solution 1: ** Script deep optimization, although it can solve some repetitive operation and maintenance problems, but the maintenance cost is too high, can really write a good script operation and maintenance personnel is difficult to recruit. We used scripts all the time, but we couldn’t fully automate and had to manually purchase ECS for emergency capacity expansion.

** Solution 2: ** Self-built K8s, although it can solve the problem of high-density deployment, greatly reduce the cost, and can also automatically expand the application instance, but the explosion radius is larger than ECS, we are still a little worried. The most important thing is that K8s learning cost is really too high, build an environment to run easy, but serious production words or to form a good professional team, in the short term obviously unable to complete.

​ ​

Later, after the introduction of aliyun colleagues, soon came up with the third plan — SAE, which was finally implemented.

Scheme 3: Choose Ali Cloud Serverless application Engine (SAE for short). The first impression of SAE is that it is easy to learn and save time and effort. WAR/JAR package can be uploaded and deployed directly without any modification, and there is no need to buy machine operation and maintenance machine, which can save a lot of development time. What’s more, SAE is a super-sized elastic resource pool that can play as much as you want and whenever you want, perfect for a pumpkin movie business scenario.

At the beginning of the SAE impression

In actual combat

ROUND 1: CI/CD Pipeline — Accelerate iteration efficiency

The first thing we did before migrating the business was to streamline CI/CD based on Travis CI + SAE to improve release efficiency. Previously, when we submitted code on GitHub, the Travis CI tool automatically integrated, automatically unit tested, and when the test passed, uploaded the file to private OSS for deployment to ECS. Using SAE, you simply change deploy to ECS to deploy to SAE, which is very simple and has no impact on the development side. In addition, when the application is deployed, it can also choose to deploy a variety of release strategies, such as single batch, batch, canary, and so on. When abnormal, it can immediately stop and roll back, which is very efficient.

ROUND 2: The first APPLICATION API gateway goes online

Now it’s time to pick the first one. We made a bold decision to migrate the API gateway first. API gateway is one of our core applications internally and one of our most stressful applications. Why?

First, it has deployments all over the country. Second, it has a large number of ECS clusters. As long as we operate the dispatching system to send part of the traffic to SAE, if SAE is unstable, the traffic can also be switched back to ECS instantly, with almost no impact on users. Third, as the total traffic entrance, API gateway has more burst traffic, which matches the elasticity advantage of SAE and can test whether SAE is suitable for our business to the greatest extent.

At first, we were also worried about the production environment. In order to prevent accidents, we decided to let the original ECS instance run with the SAE instance. If there was a problem on either side, we would switch traffic immediately, and then use the ECS instance as a disaster recovery link.

ROUND 3: The API gateway automatically expands and shrinks to cope with sudden traffic increase

The stability of light at normal flow rates does not prove SAE is reliable. So we have focused on testing and production environments to verify the resilience of SAE when the flow surge.

We used the flow scale 5 times that of the last hit movie to conduct systematic pressure measurement, set the threshold values of CPU, memory, QPS and RT measured by pressure in SAE elastic rules, and then observed the application monitoring indicators on SAE console in real time, and found that all were normal. SAE really can automatically scale in seconds at peak times and scale down on demand at peak times, as shown in the chart below, saving about 40% of hardware costs with SAE compared to previous ECS retention.

With that, our first application API gateway was successfully migrated and our old ECS instances were taken offline. Ali Cloud SAE proved to us that the previous worries were unnecessary with its stable and efficient performance. So we continue to migrate other lines of business.

ROUND 4: Full-link monitoring and diagnosis out of the box

During the migration process, there are also occasional exceptions to the application state. SAE’s built-in ARMS monitoring system provides excellent support for our analysis, troubleshooting and solution of online problems, saving a lot of troubleshooting time. In SAE, you can see the application’s call topology, locate slow SQL, slow service, method call stack, and then code level problems.

SAE also accepted our rationalization suggestions and provided TopN application reports in various dimensions: one person can easily operate and maintain hundreds of applications, which applications are the most problematic and the most important to focus on.

ROUND 5: [Enterprise-level features] Permission Isolation & Approval

SAE also helped us solve a long-standing problem: authority isolation and approval.

Take a look at this comparison diagram: in the old ECS model, when accessing applications across teams, you needed to configure user groups and add RAM permissions to different people at machine granularity. If o&M deployment is involved, script configuration needs to be modified to configure the user name, password, and operation log of the new machine on the springboard machine. Once there are many people and many machines, permission configuration becomes very tedious. Moreover, the operation and maintenance operation is not approved, the risk is not controllable, the development has the user name and password of the machine, and the release is relatively random.

With SAE, everything is simpler. Add permissions by application granularity. You only need to add an application once, saving worry and effort. SAE also designed the operation and maintenance approval process through the master sub-account: after the sub-account initiates an operation and maintenance operation of a resource, the operation can only continue after the approval of the master account; otherwise, SAE will suspend the task, effectively reducing the quality risk caused by the random online publication.

ROUND 6: Finish landing

By day 7, ALL of our applications were Severless and ALL ON SAE. The whole migration process was smooth, without any modification costs, zero failures, and only 1 or 2 r&d personnel were involved.

We have made an overall analysis of the value SAE brings to pumpkin movies, which can be summarized as follows:

1) Faster expansion: SAE will automatically scale and adjust the number of instances according to optimization, no longer considering insufficient peak periods and wasted low periods.

2) Faster release: release efficiency is improved through CI/CD pipeline, local one-click deployment to cloud SAE is quickly realized through Cloudtoolkit plug-in, which is convenient for development and debugging.

3) Easier operation and maintenance: No operation and maintenance is not no operation and maintenance. For us, when you receive an alarm, board the console and start to repair, you have basically completed the operation and maintenance, and the whole operation and maintenance speed is faster than manual

4) Faster problem detection: SAE’s built-in monitoring capability saves a lot of time for us to troubleshoot problems.

After calculation, compared with our previous traditional server model, the development efficiency has increased by 70%, the cost has decreased by more than 40%, and the capacity expansion efficiency has increased by more than 10 times.

Summary & Expectation

Finally, we share with you some summary and pit we stepped on in the process of use.

1) Multiple availability zone deployment: ** Before, all of our applications were only equipped with single availability zone A, but we suffered A loss. Later, at the suggestion of SAE team, all of our applications were deployed as multiple availability zones. Therefore, this point of attention is strongly recommended.

2) Batch/grayscale release strategy: ** Multi-instance applications must be batch or grayscale release to avoid the impact of abnormal conditions on the overall business, and the whole release must do a complete test.

3) Health check: ** The self-defined health check script must be pre-checked to avoid application startup failures due to script faults.

4) Reasonable setting of capacity expansion threshold: ** The capacity expansion threshold must be tested more often and determined after the system pressure test. If necessary, adjust the threshold appropriately, and prefer to expand more instances than online failures.

5) Configure SLS log and ARMS alarm: ** It is recommended to configure SLS log and ARMS alarm to provide great help for locating problems later.

We also had a lot of expectations from SAE: for example, we wanted to optimize Java cold startup times, and some of our applications took 1-2 minutes with light startup (which SAE later learned to do). We also want SAE to take it to the next level and provide a complete Serverless architecture for users: not only the application layer, but also the database, network, etc., so that we can completely focus on business development. While this may be difficult and take some time to implement, we have confidence in SAE.

​ ​

Finally, I would like to express my heartfelt thanks to Ali Cloud SAE for its cooperation and support in the development of Pumpkin Movies. After using SAE, there has never been a large-scale failure so far. Along the way, we gained a lot of experience that allowed us to quickly deliver services to our users.

As always, Pumpkin Movies will bring the best film resources and the most extreme movie-watching experience to the masses of fans and create more positive energy for the society. Also wish Ali Cloud dare to dream, dare to innovate and achieve new achievements, and serve more enterprises around the world!