Author: Feng Yibo
Jobs is a common scenario of Internet services. In scenarios such as AI training, live broadcasting (video transcoding), data cleaning (ETL), and scheduled inspection, whether the task platform can support fast and high concurrent task startup performance, provide high utilization of offline computing resources, and provide rich upstream and downstream ecology are the core pain points of such scenarios. Function computing is an event-driven, fully managed computing service, and its execution mode naturally fits in well with this kind of Job scenario, providing comprehensive support for the above pain points and facilitating the serverless cloud of “tasks”.
Function calculation and Serverless Jobs
What capabilities should the “Job” system have?
In the above “Job” scenario, a task-processing system should have the following capabilities:
- Task triggering: Supports flexible task triggering modes. For example, support client manual trigger, support event source trigger, support timing trigger, etc.
- Task choreography: able to arrange complex task flow and manage the relationship between sub-tasks, such as branch, parallel, loop and other logic;
- Task scheduling and status management: scheduling task priority, multi-tenancy isolation and task status management, supporting multiple task concurrency and traffic limiting; Able to manage task status, control the execution of tasks, etc.
- Resource scheduling: Solve the problem of running resources of tasks. This includes a variety of runtime support, cold start delay control of computing resources, and mixing of in-/ off-line tasks. The ultimate goal is to make the system have high resource utilization;
- Task observability: view and audit the execution history of the task; Task execution logs;
- Upstream and downstream ecology of task scheduling system: The task scheduling system can be naturally connected to upstream and downstream systems. Such as the ability to integrate with Kafka/ETL ecology, messaging ecology, etc.
Ali cloud function computing Serverless Job
The panorama of function computing Jobs capability is as follows:
Figure 1: Panoramic view of function computing Jobs capability
Comparison of Job capabilities of common task scheduling systems
Table 1: Comparison of common task scheduling system capabilities
In general, task scheduling systems such as batch computing products from some cloud vendors and the open source K8s Jobs support minimum granularity that is scaled at the instance level and does not have the ability to manage large-scale tasks. Therefore, it is suitable for low concurrency, heavy load and ultra-long time operation (such as genetic computing, large-scale machine learning training). However, some open source process execution engines and big data processing systems often lack a series of abilities such as flexibility, multi-tenancy isolation, high concurrency management and visualization. Function computing, as an o&m free Serverless platform, well combines the advantages of the above different systems. In addition, Serverless’s inherent elastic ability well supports the requirements of high concurrent peak and trough scenarios prevailing in tasks.
Recommend best practices & customer cases
AI training & Reasoning
The core appeal of the scene:
- At the same time, it supports real-time reasoning + off-line training, which requires cold start.
- There are obvious peaks and troughs, and the amount of computation is large, requiring high concurrency, and there is little need for coordination between computing instances.
- Generally, container images are required to run custom libraries for training.
Case 1: netease Cloud Music – audio and video processing platform
The music “discovery” and “sharing” functions of netease Cloud Music rely on the analysis and extraction of basic features of music. When running this kind of recommendation algorithm and data analysis, it needs to rely on very large computing power to process the original music file. Netease cloud music audio and offline processing platform in the asynchronous processing mode – and priority queue optimization algorithm to cluster virtualization – – cloud the original biochemical algorithm mirror frame After the series of evolution, the choice function calculation as video platform infrastructure, effectively solve the continually expand the scale of the calculation of the hard to issue such as difference of ops and elastic.
Case 2: Autonomous Database service – Database inspection platform
The internal database inspection platform of Ali Cloud Group is mainly used for optimization analysis of SQL statement queries and logs. The tasks of the whole platform are divided into two main tasks: offline training and online analysis. Among them, the calculation scale of online analysis business reaches tens of thousands of cores, and the daily execution time of offline business also reaches millions of hours. Due to the uncertainty of online analysis and offline training time, it is difficult to improve the overall resource utilization rate of the cluster, and the elastic computing force is required to support the cluster in peak hours. Finally, a database inspection platform was built by using function calculation to meet the daily AI online reasoning and offline training tasks of the model.
Case 3: Focus Media-Serverless Image processing business
In advertising business, it is common to run deep learning algorithm for image processing, comparison and recognition. Such business often has the characteristics of diverse data sources, uncertain single instance processing time, obvious peaks and troughs, and high requirements for task observation. Using self-purchased machine to run the service not only needs to consider the operation and maintenance of the machine and the utilization of resources, but also is difficult to adapt to a variety of picture sources, and it is difficult to achieve fast online service.
This type of business is greatly facilitated by multiple event source triggering support for function calculation. Focus media uses OSS/MNS trigger function calculation to solve the problem of diverse data sources. The user’s picture data can be uploaded to OSS or MNS, and the corresponding trigger will directly trigger the function calculation to complete the picture processing task. The elasticity of function calculation and pay-per-quantity mode solve the trouble of resource utilization rate and machine operation and maintenance. In terms of observability, the task processing instance uses the stateful asynchronous invocation mode, which can trace any triggered task and facilitate business to troubleshoot and retry failed tasks.
Video transcoding & Live streaming & Recording and broadcasting to live streaming
Livestreaming transcribing/recording/broadcasting to livestreaming business usually has the characteristics of real-time, irregular and quantitative services at the same time:
- At the same time, it is required to pull up processing instances and stop transcribing instances at any time.
- Peak hours are concentrated in the daytime hours, with few requests at night. Resource utilization and cost are therefore major considerations.
For video transcoding scenarios, in addition to general flexibility requirements, they also require CPU flexibility to achieve higher resource utilization. Such as:
- Resource specifications: Due to the difference of transcoding output bit rate, it is hoped that resources of different specifications can be flexibly bounced for cost consideration.
- Random running time. Due to the need to improve the transcoding efficiency, the video is often fragmented, so the moment the task comes may require a high number of instances.
- In order to improve the transcoding efficiency, it may be processed separately after fragmentation, which involves sharing data between multiple functions.
- Need container image mode to run some of their own libraries, and tend to be quick to start;
- Due to the offline service nature of transcoding, some task records need to be saved after the task is complete for auditing and troubleshooting.
Case 1: New Oriental cloud classroom system Serverless video processing platform
New Oriental Cloud classroom system supports video live, transcoding, vod and other New Oriental online education scenes. With the increase of business volume, the low resource utilization of self-built computer room has become the core pain point of business due to the obvious characteristics of peaks and valleys in the task processing platform of live broadcast transcription and video transcoding. In order to improve the overall resource utilization rate, the cloud classroom system uses function calculation for the above functions, which can flexibly choose the specifications of computing resources according to the business characteristics. The millisecond cold start performance and the “Pay as you Go” payment mode also make the overall computing resource utilization rate very high. The whole system has the lowest cost while satisfying the crest calculation force.
In the process of serverless business scenarios, the cloud classroom system uses Ali Cloud function to calculate the stateful call mode. This mode is also designed for Job scenarios, allowing you to query historical records and gracefully stop tasks. In storage, video temporary files adopt the function computation – NAS scheme. New Oriental can poll multiple function services for load balancing through function scheduler of video platform. Each service mounts different NAS, which improves the utilization rate of NAS temporary storage within functions while achieving file sharing and further reduces the cost of resource usage.
Case 2: Milian – Live video real-time compliance audit platform
Milian’s live dating business involves the main task of video processing: video framing. Video framing is carried out while pulling the stream and uploaded to the target storage. Due to the characteristics of peaks and troughs, such live broadcast scenes have certain requirements of real-time performance and long-term execution in addition to resource utilization. The audit platform finally uses the capability of functional computation to support high elasticity and long computation power, effectively supporting business scenarios.
Data processing & ETL
The core appeal of the scene:
- Elastic, high concurrency support. Resources on demand, variety, high utilization, free of operation and maintenance;
- Orchestration support for complex processes;
- Observability of the mission.
Case: Tucson Future – Automated data processing platform that makes everything simple and reliable
Future research and development of self-driving technology in Tucson relies on the accumulation of a large amount of road test data, and efficient road test and fast processing of road test data to guide model update and iteration are the core appeals of such scenarios. However, the irregular operation of the road test, the long process of data entry into the database, the interaction of multiple systems, the uncertainty of computing power and other characteristics bring great challenges to the process scheduling task for the data processing platform.
In view of the above situation, Tucson will explore the automation of data processing platform in the future. The data processing platform uses Serverless workflow to arrange the whole process, and solves the problem of data access between cloud and cloud through the message service MNS supported by native.
In addition to scheduling, Tucson will use the inputoutput mapping and status reporting mechanism of tasks in the future, efficiently manage the life cycle of each task in the process and data transfer between each other, maintain the status of tasks in the process and data update in the execution process, and solve the data processing requirements of long uncertain processes.
conclusion
Combined with the above cases and analysis, the elasticity, observability, queue isolation capability and complete event ecology of function calculation support this kind of task scenario very well. The brief summary is mainly reflected in the following aspects:
-
Trigger function calculation of tasks supports timing triggers, OSS triggers, and various message queue triggers, which provides rich capabilities for EDA applications and data processing scenarios of various data sources.
-
Task Scheduling & Task scheduling function computing is seamlessly integrated by Aliyun Serverless Workflow service, which supports sequential, branch, parallel and other ways to schedule distributed tasks, track the state transitions of each task, and perform pre-defined retry logic when necessary. The combination of Serverless workflow + function calculation can well support the operation of complex and long processes;
-
At the resource level, Serverless represents its core strengths: development free of operation and maintenance, and high flexibility and availability. Compared with self-built, using the serverless architecture, you only need to pay according to the usage of the actual task, which saves the cost and saves the trouble of operation and maintenance. Function evaluation supports multiple runtime languages, as well as the ability to run custom container images, greatly facilitating the development and debugging process.
-
In terms of observability, Serverless workflow and function calculation provide rich observability indicators and query methods for multi-task processes and single-task processes. It is convenient to search history, observe indicators and logs for performing medium tasks, and facilitate debugging and problem tracking.
In the future, Functional computing-Serverless Jobs will focus on vertical task processing scenarios, including longer instance execution time, richer observable indicators, more powerful task scheduling strategies and end-to-end integration capabilities, and is committed to providing you with the “shortest path” in vertical scenarios to help your business take off.
Click here for more information about function calculation
For more information, please scan the QR code below or search wechat (AlibabaCloud888) to add cloud native assistant! For more information!