MaxCompute is dedicated to storing and computing structured data in batches, providing massive data warehouse solutions and analysis and modeling services. In this section, logView is used to check the causes of slow tasks in MaxCompute

Here the problem of slow tasks can be divided into the following categories

  1. Queuing caused by insufficient resources (usually annual and monthly projects)
  2. Data skew and data bloat
  3. Low operating efficiency caused by user logic

First, insufficient resources

SQL tasks consume CPU and Memory resources. How does logView view reference links

1.1 Viewing the Job Time and execution phase

1.2 Waiting for submitting tasks

If Job Queueing… is displayed after the task is submitted. It is possible that other people’s tasks occupy the resources of the resource group, causing their own tasks to queue up.

Waiting for scheduling in SubStatusHistory is the time to wait

1.3 Insufficient Resources After task Submission

Here is another case, although the task can be submitted successfully, due to the large amount of resources required, the current resource group cannot start all instances at the same time, resulting in the task progress, but not fast execution. This can be observed using the Latency Chart function in the logView. Latency Chart can be seen by clicking on the corresponding task in the detail

The figure above shows the running status of a well-resourced task, and you can see that the bottom of the blue area is flat, indicating that all instances were started at about the same time.

The lower end of the graph presents a ladder up, indicating that task instances are scheduled bit by bit and there are not enough resources to run the task. If the task is of high importance, consider increasing resources or reprioritizing the task.

1.4 Causes of insufficient resources

1. Use the CU manager to check whether the CU is full, click the corresponding task point, and check the job submission status at the corresponding time

Sort by CPU usage

(1) A task occupies a large CU, find the cause of the large task logView (too many small files, data volume does need so many resources).

(2) Even cu ratio indicates that multiple large tasks are submitted at the same time to directly fill up cu resources.

2. Cu usage is slow due to excessive small files

The parallelism of map stage is based on the fragment size of input files, so as to indirectly control the number of workers in each map stage. The default is 256m. In the map phase M1, the I/O bytes of each task are only 1 MB or tens of KB. As a result, more than 2500 degrees of parallelism are immediately filled with resources, indicating that there are too many files in the table and small files need to be merged.

Merge small files help.aliyun.com/knowledge\_…

3. Resources are occupied due to a large amount of data

Can purchase resources, increasing the set can be added if it is a temporary assignment odps. Task. Quota. Preference. The tag = payasyougo; Parameter that allows a specified job to temporarily run to a pay-per-volume large resource pool,

1.5 How to adjust the task parallelism

The parallelism of MaxCompute is automatically calculated and executed based on the input data and task complexity. Generally, no adjustment is required. Ideally, the higher the parallelism, the faster the processing speed

Parallelism of MAP phase

* * odps. Stage. Mapper. The split. Size: * * modify each Map Worker input data, namely fragmentation of the input file size, thereby indirectly control the number of Worker in stages each Map. Unit: MB the default value is 256 MB

Parallelism of Reduce

** Odps.stage.reducer. Num: ** Modify the number of workers in each Reduce stage

Odps. Stage. Num: Modify MaxCompute specified task under all Worker concurrency, priority is lower than the odps. Stage. The mapper. The split. The size, odps. Stage. The reducer. Mem and (odps) stage. The joiner. Num attribute.

**odps.stage.joiner.num: ** Changes the number of workers in each Join stage.

2. Data skew

Data skew

【 characteristic 】 Most instances in a task have finished, but some instances have not finished (the long tail). In the figure below, most (358) instances have ended, but there are still 18 instances in Running state. These instances run slowly either because of the large amount of data to be processed or because of the slow processing of certain data.

Solution: help.aliyun.com/document\_d…

Third, logical problems

This refers to the user’s SQL or UDF logic being inefficient or not using optimal parameter Settings. The phenomenon is that a Task takes a long time to run, and the running time of each instance is fairly uniform. There is a greater variety of situations, some of which are logically complex and some of which have a lot of room for optimization.

Inflation data

【 Feature 】 The amount of output data of a task is much larger than the amount of input data.

For example, 1G of data is processed into 1T, and the operation efficiency will be greatly reduced if 1T of data is processed in one instance. The amount of input and output data is reflected in the I/O Record and I/O Bytes of the Task:

Solution: Confirm that the business logic really needs this and increase the parallelism of the corresponding phases

The UDF execution efficiency is low

The execution efficiency of a task is low and the task has user-defined extensions. Even UDF execution timeout error: “Fuxi job failed – WorkerRestart errCode: 252, errMsg: kInstanceMonitorTimeout, usually under caused by bad udf performance”.

First determine the location of the UDF. Click on the slow FuXI Task to see if the Operator Graph contains a UDF. For example, the following figure shows a Java UDF.

You can check the running Speed of this operator by viewing stdout of Fuxi instance in logView. Normally, the Speed(Records /s) is in the million or 100,000 level.

Solution: Check the UDF logic and use built-in functions whenever possible

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.