Interviewer: What do you know about the MPP architecture?

Many people have encountered this question in the interview, because they know less about the concept of MPP, many people are stuck, but many of our commonly used big data computing engines are MPP architecture, such as Impala, ClickHouse, Druid, Doris and so on are familiar with MPP architecture.

This article is published in the public account [5 Minutes to Learn Big Data]. Follow the public account to get the latest big data technology articles

Many OLAP engines using THE MPP architecture claim to be 100 million seconds off.

This article is divided into three parts. The first part explains the MPP architecture in detail, the second part analyzes the similarities and differences between the MPP architecture and the batch architecture, and the third part introduces the OLAP engine using the MPP architecture.

One, MPP architecture

MPP is a method of classifying servers from the perspective of system architecture.

At present, there are generally three types of commercial servers:

  1. SMP (Symmetric Multiprocessor Architecture)
  2. NUMA (Non-consistent Storage Access Structure)
  3. MPP (Massively parallel Processing Structure)

Our protagonist today is MPP, because with the mature application of distributed and parallelization technology, MPP engine gradually shows powerful high throughput and low latency computing ability, and many engines using MPP architecture can reach “billion seconds off”.

Consider these three structures:

1. SMP

That is, symmetric multi-processor structure, that is, multiple cpus of the server work symmetrically, no primary or secondary or subordinate relationship. The main characteristic of an SMP server is sharing, where all the resources in the system (such as CPU, memory, I/O, and so on) are shared. It is this characteristic that leads to the main problem of the SMP server, which is very limited scalability.

2. NUMA

The non-consistent storage access structure. This structure is to solve the problem of insufficient SMP scalability, using NUMA technology, can be dozens of CPUS in a server. The basic feature of NUMA is that it has multiple CPU modules, and nodes can connect and exchange information with each other through interconnecting modules, so that each CPU can access the entire system’s memory (an important difference from an MPP system). However, the speed of access is different because the CPU can access local memory much faster than the memory of any other node on the system, which is why non-consistent storage access to NUMA is possible.

This structure also has some drawbacks. Since the latency of accessing remote memory is much greater than that of accessing local memory, system performance does not increase linearly as the number of cpus increases.

3. MPP

Massively parallel processing. MPP system expansion is different from NUMA. MPP is a system of multiple SMP servers connected through a certain node network, working together to complete the same task. From the user’s perspective, MPP is a server system. Each node accesses only its own resources, so it is a Share Nothing structure.

MPP structure is the most extensible, and the theory can be expanded indefinitely. Since the MPP is connected by multiple SPM servers, the CPU of each node cannot access the memory of the other node, so there is no remote access problem.

MPP Architecture Diagram:

The CPU in each node cannot access the memory of another node. The information interaction between nodes is realized through the network of nodes. This process is called data redistribution.

But the MPP server requires a sophisticated mechanism to schedule and balance the load and parallel processing of each node. Currently, some MPP-based servers tend to mask this complexity with system-level software, such as databases. Teradata, for example, is a relational database software based on MPP (this is the first USES the MPP architecture database), based on this database to develop applications, no matter how many nodes by the backend server, developers are faced with the same database system, without having to consider how to scheduling in which certain node load.

MPP architecture features:

  • Task parallel execution;
  • Distributed storage of data (localization);
  • Distributed computing;
  • High concurrency: the concurrent capacity of a single node is greater than 300 users;
  • Scale out to expand the capacity of cluster nodes.
  • A Shared Nothing architecture.

Differences between NUMA and MPP:

There are many similarities. First, NUMA and MPP are composed of multiple nodes. Second, each node has its own CPU, memory, I/O, etc. Can all festival site interconnection mechanism for information interaction.

What are the differences between them? Firstly, the node interconnection mechanism is different. NUMA node interconnection is realized inside the same physical server, while MPP node interconnection is realized outside different SMP servers through I/O.

Second, the memory access mechanism is different. Within the NUMA server, any ONE CPU can access the entire system’s memory, but the performance of remote memory access is far lower than that of local memory access. Therefore, remote memory access should be avoided when developing applications. In the MPP server, each node accesses only local memory, and there is no remote memory access problem.

2. Batch and MPP architecture

What are the similarities and differences between batch architectures such as MapReduce and MPP architectures, and what are the pros and cons of each?

Similarities:

First of all, the batch processing architecture and MPP architecture are both distributed parallel processing, which distributes tasks to multiple servers and nodes in parallel. After the calculation is completed on each node, the results of each part are summarized to get the final result.

Difference:

For example, if we execute a task, it is divided into multiple tasks. For MapReduce, these tasks are randomly assigned to idle executors. For an MPP engine, each task that processes data is bound to a specific Executor that holds that slice of data.

Because of the above differences, the two architectures have their own advantages and disadvantages:

  • Advantages of batch processing:

For batch architectures, if an Executor is too slow, it will be slowly assigned fewer tasks to execute. Batch architectures have a predictive execution strategy that predicts that an Executor is slow or faulty. Then the task will be assigned less or not at all.

  • MPP’s flaws:

For MPP, because tasks and executors are bound, if an Executor is too slow or fails, the performance of the entire cluster will be limited by the speed of the failed Executor (the so-called bucket shortboard effect), so the biggest drawback of MPP is the shortboard effect. On the other hand, the more nodes in a cluster, the greater the probability of a node to have problems. Once a node has problems, the performance of the whole cluster will be limited for MPP architecture. Therefore, it is not easy for MPP architecture to have too many cluster nodes in actual production.

  • Defects in batch processing:

Everything comes at a cost, and in the case of batch processing, intermediate results are written to disk, which severely limits the performance of processing the data.

  • Advantages of MPP:

The MPP architecture does not require intermediate data to be written to disk, because a single Executor works on a single task, and therefore can simply stream the data directly to the next stage of execution. This process is called Pipelining and it provides a significant performance boost.

For example, to implement the join operation of two large tables, for batch processing, for example, Spark writes three times to the disk. (The first write: Table 1 performs shuffle according to the join key; The second write: Table 2 Performs shuffle according to the join key. Third write: the Hash table is written to the disk), whereas the MPP requires only one write (the Hash table is written). This is because MPP runs the mapper and Reducer at the same time, while MapReduce divides them into dependent tasks(DAGs). These tasks are executed asynchronously, so the data dependency must be solved by writing to the intermediate data shared memory.

Batch and MPP architecture fusion:

The advantages and disadvantages of the two architectures are obvious, and they are complementary. If we can use the two architectures together, we can maximize the advantages of each. Batch processing and MPP are indeed moving towards convergence, and there are already some design schemes, once the technology is mature, it may become popular in the big data field. Let’s wait and see!

Three, MPP architecture of OLAP engine

There are many OLAP engines that use MPP architecture. The following is a comparison of several common engines, which can provide a reference for the company’s technology selection.

OLAP engines using MPP architecture can be divided into two categories. One is the engine that does not store data itself and is only responsible for computing; One is an engine that stores data and computes itself.

1) An engine that is only responsible for computing, not for storage

1. Impala

Apache Impala is a query engine using MPP architecture, itself does not store any data, direct use of memory for calculation, data warehouse, real-time, batch processing, multi-concurrency and other advantages.

Provides SQL like (Hsql like) syntax, in multi-user scenarios can also have high response speed and throughput. It is implemented by Java and C++, Java provides the query interaction interface and implementation, C++ implements the query engine part.

Impala supports shared Hive Metastore, but does not use slow Hive+MapReduce batch processing. Instead, by using a distributed Query Engine (consisting of Query Planner, Query Coordinator, and Query Exec Engine) similar to that found in commercial parallel relational databases, Data can be queried directly from HDFS or HBase using SELECT, JOIN, and statistics functions, thus greatly reducing latency.

Impala is often served in conjunction with the storage engine Kudu. The main advantages of this are fast queries and support for Update and Delete data.

2. Presto

Presto is a distributed MPP query engine that does not store data itself, but can access multiple data sources and support cascading queries across data sources. Presto is an OLAP tool that specializes in complex analysis of large amounts of data; But for OLTP scenarios, Presto is not good at it, so don’t use Presto as a database.

Presto is a low latency high concurrency memory computing engine. It can connect to a variety of data sources, including Hive, RDBMS (Mysql, Oracle, Tidb, etc.), Kafka, MongoDB, Redis, etc.

2) An engine that is responsible for both computing and storage

1. ClickHouse

ClickHouse is an open source column database that has been gaining attention in recent years for data analysis (OLAP) applications.

It contains its own storage and computing capabilities, fully autonomous implementation of high availability, and support for the full SQL syntax including JOIN, etc., has obvious technical advantages. Compared with Hadoop system, it is easier to handle big data in the way of database, with low learning cost and high flexibility. At present, the community is still in rapid development, and in the domestic community is also very hot, each big factory have followed the large-scale use.

ClickHouse does a very careful job on the computing layer to squeeze as much hardware power as possible and speed up queries. It implements many important technologies such as single-machine multi-core parallelism, distributed computing, vectoquantization execution, SIMD instruction and code generation.

ClickHouse customized a new set of efficient column storage engine based on OLAP scenario requirements, and implemented ordered data storage, primary key index, sparse index, data Sharding, data Partitioning, TTL, master/slave replication and other rich features. Together, these features lay the foundation for ClickHouse’s extremely fast analytical performance.

2. Doris

Doris is a big data analysis engine led by Baidu and adapted from Google Mesa paper and Impala project. It is a massive distributed KV storage system designed to support medium-scale, highly available and scalable KV storage clusters.

Doris can implement massive storage, linear expansion and smooth expansion, automatic fault tolerance and failover, high concurrency, and low o&M costs. Deployment scale: 4-100 servers are recommended.

The main architecture of Doris3: Data Transfer (DT) is responsible for Data import, Data Seacher (DS) module is responsible for Data query, Data Master (DM) module is responsible for cluster metadata management, and Data is stored in Armor distributed key-value engine. Doris3 relies on ZooKeeper to store metadata, and other modules rely on ZooKeeper to be stateless, so that the entire system can achieve a single point of failure.

3. Druid

Druid is an open source, distributed, column-oriented, real-time analytical data storage system.

The key features of Druid are as follows:

  • Subsecond level OLAP query analysis: using column storage, inverted index, bitmap index and other key technologies;
  • Filtering, aggregation and multi-dimensional analysis of massive data in sub-second level;
  • Real-time streaming data analysis: Druid provides real-time streaming data analysis, as well as efficient real-time writing;
  • Subsecond visualization of real-time data;
  • Rich data analysis capabilities: Druid provides a user-friendly visual interface;
  • SQL query language;
  • High availability and high scalability:
    • Druid work nodes have a single function and are not interdependent;
    • Druid clusters are easy to manage, fault tolerance, disaster recovery, and capacity expansion;
4. TiDB

TiDB is an open source distributed relational database independently designed and developed by PingCAP. It is a converged distributed database product that supports BOTH OLTP and OLAP.

TiDB is compatible with important features such as the MySQL 5.7 protocol and MySQL ecology. The goal is to provide users with one-stop OLTP, OLAP, HTAP solutions. TiDB is suitable for various application scenarios, such as high availability, high consistency requirements, and large data scale.

5. Greenplum

Greenplum is a very powerful relational distributed database based on the open source PostgreSQL. In order to be compatible with the Hadoop ecology, HAWQ is also launched. The analysis engine retains the high-performance engine of Greenplum. The lower layer storage uses HDFS instead of local hard disks to avoid the problem of poor reliability of local hard disks and integrate into the Hadoop ecology.

3) Common engine comparison

Here is a summary of common OLAP engine comparisons:


Wechat search public account [five minutes to learn big data], to get the latest big data technology articles