This article is intended for those with basic Java knowledge

Author: HelloGitHub – Salieri

HelloGitHub introduces the Open Source project series. The PowerJob series is coming to an end. Do you still enjoy it? Feel free to comment on your feelings and what you want to see.

Project Address:

Github.com/KFCFans/Pow…

1. Introduction to MapReduce

MapReduce is a programming model for parallel computation of large data sets (larger than 1TB). The concepts Map and Reduce, which are the main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. Current software implementations specify a Map function that maps a set of key-value pairs to a new set of key-value pairs, and a concurrent Reduce function that ensures that each of all mapped key-value pairs shares the same key set.

MapReduce is designed for big data processing and is a classic implementation of divide-and-conquer, as you can see from the keyword “large data sets.” In a nutshell, Map is used to divide a large chunk of data into smaller data blocks that can be processed by a single machine for shuffle processing. After the processing is complete, Reduce is used to summarize the results. The specific process is shown in the following figure.

Second, demand background

As a task scheduling middleware, PowerJob is responsible for task scheduling. As a big data processing model, MapReduce’s core function is parallel processing of large-scale data. On the surface, PowerJob and MapReduce are completely unrelated. Many people will have the following psychological activities when they first see the two keywords together: PowerJob and MapReduce.

“Why do you want MapReduce to be such a high end concept for a task scheduling framework? Just rub against it?”

In fact, this problem, another Angle to think, can find the answer.

Generally speaking, offline data synchronization tasks need to be scheduled. For some services with a certain volume, the offline data scale may be large, and a single machine cannot complete the calculation well. In order to solve this problem, the current market scheduling framework generally supports static sharding as a relatively simple way to complete distributed computing, that is, by specifying the number of sharding to mobilize a fixed number of machines to perform tasks in a fixed interval. But obviously, this approach is very inflexible and very limited.

So how do you implement distributed computing for complex and large tasks? Alibaba’s SchedulerX team has an answer like MapReduce. In the form of their own programming, Map method is implemented to complete task segmentation, and then subtask results are summarized by Reduce to complete highly customizable distributed computing.

The MapReduce implementation of PowerJob draws on this advanced thinking, thanks again to the SchedulerX team

Example usage

In PowerJobs, MapReduce is no longer a lofty, untouchable concept. Thanks to a powerful underlying implementation and elegant API design, developers can complete distributed computing for large tasks with only a few lines of code, as shown in the following example.

For tasks with distributed computing requirements, we need to inherit the specific abstract class MapReduceProcessor to enable distributed computing capabilities. This interface requires developers to implement two methods, namely process and Reduce. The former is responsible for the specific execution of the task, while the latter is responsible for bringing together all the sub-tasks to achieve specific results. At the same time, the abstract class provides two available methods by default: isRootTask and Map. The isRootTask method is used to determine whether the current Task is the root Task. If the current Task is the root Task, the Task is divided (PowerJob supports maps at any level, not only at the root Task), and then the map method is used to distribute sub-tasks.

Here is a simple code example to help you understand. The following code simulates the current mainstream “static sharding” distributed processing in the market, that is, the console specifies the number of sharding and parameters (for example, the number of sharding is 3, and the parameters are: 1=a&2=b&3= C) to control the number of machines participating in the calculation and the starting parameters. This is a good example of how the PowerJob MapReduce processor can be powerful.

First, we get the parameters of the console configuration through the context’s getJobParams method, namely the shard parameters 1=a&2=b&3=c. This shard parameter indicates that three machines are now required to participate in the execution, and the sub-task starts with parameters A, B, and C on each machine. Therefore, we can create the SubTask object according to this rule, passing in the sharding index index and the sharding parameter params.

After the subtask is sharded, the map method can be called to complete the task distribution.

Once distributed, the SubTask enters the process method again, only this time as a SubTask instead of a RootTask. The context.getSubTask() method returns Object, so we need to use the Java instaneof keyword to determine the type (without multilevel maps, of course, If the object is of SubTask type, that is, the SubTask processing stage has been carried out, start writing the SubTask processing logic.

After all subtasks are executed, PowerJob invokes the Reduce method and passes in the running results of all subtasks for developers to construct the final result of the task.

@Component
public class StaticSliceProcessor extends MapReduceProcessor {

    @Override
    public ProcessResult process(TaskContext context) throws Exception {
        OmsLogger omsLogger = context.getOmsLogger();
        
        // Root task distributes tasks
        if (isRootTask()) {
            // Pass the shard argument from the console, assuming the format KV: 1=a&2=b&3=c
            String jobParams = context.getJobParams();
            Map<String, String> paramsMap = Splitter.on("&").withKeyValueSeparator("=").split(jobParams);

            List<SubTask> subTasks = Lists.newLinkedList();
            paramsMap.forEach((k, v) -> subTasks.add(new SubTask(Integer.parseInt(k), v)));
            return map(subTasks, "SLICE_TASK");
        }

        Object subTask = context.getSubTask();
        if (subTask instanceof SubTask) {
            // Actual processing
            // If the subTask is still large, you can continue to distribute it
            
            return new ProcessResult(true."subTask:" + ((SubTask) subTask).getIndex() + " process successfully");
        }
        return new ProcessResult(false."UNKNOWN BUG");
    }

    @Override
    public ProcessResult reduce(TaskContext context, List<TaskResult> taskResults) {
        // Do some statistics as required... If not, use the Map processor
        return new ProcessResult(true."xxxx");
    }

    @Getter
    @NoArgsConstructor
    @AllArgsConstructor
    private static class SubTask {
        private int index;
        privateString params; }}Copy the code

Four, principle realization

The MapReduce idea of PowerJob is mainly derived from the article Schedulerx2.0 distributed computing principles & best practices.

Due to the division of functional responsibilities (PowerJob-Server is only responsible for task scheduling, operation and maintenance), the powerJob-worker automatically computes MapReduce jobs.

To facilitate model design and function partitioning, PowerJob assigns three roles to the executor node: TaskTracker, ProcessorTracker, and Processor.

  • TaskTracker is the master node for each task, acting as the master in the cluster, so each task produces only one TaskTracker at a time. It is responsible for distributing subtasks, monitoring their status, checking the health of executing nodes in the cluster, and periodically reporting the running information of tasks to the server.
  • ProcessorTracker is the role responsible for executor management in each executor node, and each task produces one ProcessorTracker on each executor node (JVM instance). It manages the execution of executor node tasks, including accepting tasks from TaskTracker and reporting native task execution and execution status.
  • The Processor is the role in each executor node that is responsible for executing specific tasks, which is the real execution unit. Each task generates several processors (yes! Is the number of concurrent instances as determined by the console. It accepts the execution task dispatched from ProcessorTracker and completes the calculation.

When a distributed task needs to be executed, PowerJob-Server calculates the health of all worker nodes in the cluster based on their memory usage, CPU usage, and disk usage. The node with the highest score acts as the master node of the task, assuming the responsibility of TaskTracker. TaskTracker is created when it receives a task execution request from the server and completes three phases of initialization:

  • First, you need to initialize the embedded H2 database to store the dispatch and execution of all the subtasks.
  • Once the store is in place, TaskTracker builds the root task based on the task content dispatched by the server and persists it to an embedded database.
  • Finally, TaskTracker creates a series of scheduled tasks, including scheduled subtask dispatch, subtask execution status check, worker health check, and overall task execution status report.

ProcessorTracker is created when it receives a sub-task execution request from TaskTracker, and builds the thread pool and corresponding processor required for execution based on the task information contained in the request. When the running status of a subtask changes, ProcessorTracker needs to report the latest status back to TaskTracker in a timely manner.

As for the Processor, it is essentially a thread that encapsulates the context information of each subtask, which is submitted by the ProcessorTracker to the execution thread pool for execution and reports its execution status to its superiors.

The diagram above clearly shows how PowerJob MapReduce works, and since MapReduce is a very complex and sophisticated implementation, one article would certainly not cover the details. Therefore, this paper is biased to the overall introduction, to tell you the core components of the division basis and main functions. If you are interested in the details, the source code is the best source ~ under the guidance of this article, I personally think it can be read in less than a day ~

Five, the last

Well, that’s the end of this article and the end of the PowerJob technology column. I was going to write a farewell speech about my journey, but the longer I write, the longer I almost catch up with the body of the text… So, I’ll sneak it into the next issue