This is the 21st day of my participation in the August Text Challenge.More challenges in August
Parallel computing model
The parallel computing model usually refers to form an abstract computing model by abstracting the basic characteristics of all kinds of parallel computers (at least one kind of parallel computers) from the design and analysis of parallel algorithms.
In a broader sense, the parallel computing model provides hardware and software interfaces for parallel computing, under which the hardware designers and software designers of parallel systems can develop supporting mechanisms for parallelism to improve system performance.
Common parallel computing models are: BSP model, PRAM model, LogP model, C3 model, BDM model
Why do we need a parallel computing model?
Spark and Hadoop are iterative modes, which are only suitable for general computing. Traditional iterative models are no longer applicable in fields with a large amount of computation, such as machine learning. Parallel computing model is to solve the computational problem in some specific scenarios.
BSP model
Bulk Synchronous Parallel (OVERALL Synchronous Parallel Computing Model) is a Parallel computing model developed by British computer scientist Viliant in the 1980s.
A paper published by Google (Pregel:A System for Large-scale Graph Processing) made this concept more widely known.
And Mapreduce- like, Google does not open source Pregel, Apache according to the idea of Pregel provides a similar framework Hama.
Basic principles of BSP model
BSP model is an asynchronous MIMD-DM (DM-distributed Memory, SM–Shared Memory) model, which supports message passing system, asynchronous parallelism within blocks and coordinated synchronization between blocks.
This model is based on a Master coordination, where all Worker synchronization (lock-step) is performed and data is read from the input queue.
BSP computing model is not only an architecture model, but also a method to design parallel programs.
The design principle of BSP program is Bulk Synchrony, which is unique due to the introduction of Super Step.
A BSP program has both horizontal and vertical structure. Vertically, a BSP program consists of a series of serial Super steps, as shown in the figure, which is similar to a serial program structure.
Horizontally, in a super-step, all processes perform local calculations in parallel. A superstep can be divided into three stages, as shown in the figure.
- In the local computing phase, each processor performs local computations only on data stored in local memory.
- Global communication phase, operation on any non-local data.
- Fence synchronization phase, waiting for all communication behavior to end.
The BSP parallel computing model can be described by p, S, G and I4 parameters
- P is the number of processors (with memory).
- S is the computing speed of the processor.
- G is the number of local computation operations per second/the number of bytes transmitted per second by the communication network. This is called the combiner 昋 rate and is considered as the bandwidth factor.
- I is the global Synchronization Time cost, which is called Barrier Synchronization Time.
Assuming p processors simultaneously transmit h bytes of information, gh is the cost of communication. The synchronization and communication costs are both normalized to a specified number of processors.
Characteristics of BSP model
- The BSP model divides computation into Super steps one by one, effectively avoiding deadlocks.
- It separates the processor from the router, which only performs point-to-point messaging and does not provide functions such as composition, replication, and broadcast, masking the specific interconnection network topology and simplifying communication protocols.
- Obstacle synchronization is global synchronization implemented by hardware, which is controllable and coarse-grained, thus providing an efficient way to execute tightly coupled synchronous parallel algorithms without undue programmer burden.
- When analyzing the performance of the BSP model, it is assumed that local operations can be completed within a time step, and in each superstep, a processor can send or receive at most H messages (called h-relation).
- Algorithms designed for the PRAM model can be implemented by simulating several PRAM processors on each BSP processor.
Evaluation of BSP model
- In parallel computing, Viliant tried to build a von Neumann-like bridge between software and hardware, and the BSP model could do just that. For this reason, the BSP model is also known as the bridge model.
- In general, MIMD models for distributed storage have poor programmability, but in BSP models, if computing and communication can be properly balanced, it will have a greater advantage in programmability.
- Some important algorithms, such as matrix multiplication, parallel preordering, FFT and sorting, are implemented directly in THE BSP model to avoid the extra overhead of automatic storage management.
- The BSP model can be effectively implemented in hypercube network and optical crossover switch interconnection technology, which shows that the model is independent of specific technology implementation and only requires a certain throughput rate of routers.
- In the BSP model, the length of “super-step” must be able to fully adapt to any H-relation.
- In the BSP model, a message sent at the start of a superstep can only be used in the next superstep, even if the network latency is shorter than the length of the superstep
- The global obstacle synchronization assumption in the BSP model is supported by special hardware that may not exist in many parallel machines.
Realization of BSP model
BSP computing framework has many implementations, the most famous of which is Google’s large-scale graph computing framework Pregel, which first proposed the APPLICATION of BSP model to graph computing.
Yahoo! The contributed Apache Giraph focuses on sending graph calculations (Pagerank, shortest connections, etc.), and each Job is a Hadoop Job without a Reducer process.
Apache Hama is also the Incubator project for the ASF community, which, unlike Giraph, is a pure Java implementation of the BSP model and is intended not only for graph computation but also to provide a generic application framework for the BSP model.
PRAM model
Parallel Random Access Machine (PRAM) model, also known as SIMD model of shared storage, is an abstract Parallel computing model, which is developed directly from serial RAM model. In this model, it is assumed that there is a shared memory with infinite capacity, and there are finite or infinite processors with the same function, and they all have simple arithmetic operation and logical judgment function. At any time, each processor can interact with each other through the shared memory unit.
Advantages of the PRAM model
- PRAM model is especially suitable for the expression, analysis and comparison of parallel algorithms. It is easy to use, and many low-level details about parallel computers, such as inter-processor communication, storage system management and process synchronization, are hidden in this model.
- It is easy to design algorithms and can run on different parallel computer systems with minor modifications. As needed, some considerations such as synchronization and communication can be added to the PRAM model.
Disadvantages of the PRAM model
- A global shared memory is used in the model, and the local memory capacity is small, which is not enough to describe the performance bottleneck of distributed main memory multi-processor, and the assumption of shared single memory is obviously not suitable for distributed memory MIMD machine.
- The PRAM model is synchronous, which means that all instructions operate according to Clock Step. The user doesn’t feel synchronization, but it is time consuming and doesn’t reflect the asynchronous nature of many systems in the real world.
- The PRAM model assumes that each processor can access any unit of shared memory in unit time, so it requires no delay, unlimited bandwidth, and no overhead for inter-processor communication. It is unrealistic to assume that every processor can access any storage unit per unit of time, leaving out practical, reasonable details such as resource competition and limited bandwidth.
- Failed to describe multithreading and pipelineprefetching techniques, which are the two most common techniques used in parallel architectures today.
LogP model
LogP model, proposed by Culler(1993), is a distributed storage, point-to-point communication multi-processor model. The communication is described by a set of parameters and is synchronized implicitly.
The communication network of LogP model is described by four main parameters
- L(Latency): Indicates the upper limit of Latency required by the source processor to communicate messages (one or more words) with the destination processor. It indicates the Latency of messages on the network.
- O (overhead): indicates the overhead (including the overhead of the operating system core and network software) during which the processor prepares to send or receive a message. During this time, the processor cannot perform other operations.
- G (gap): indicates the minimum interval between two consecutive messages sent or received by a processor. The reciprocal of the interval is the communication bandwidth of the microprocessor.
- P: indicates the number of processors or memory modules
Characteristics of LogP model
- Caught the performance bottleneck between the network and the processor. G represents the communication bandwidth, and a maximum of Lg messages per unit time can be transmitted between processors.
- Processors work asynchronously, and synchronization is accomplished through messaging between processors.
- There is a certain reflection of multithreading technology. Each physical processor can simulate multiple virtual processors (VPS), and computations do not terminate when a VP has an access request, but the number of VPS is limited by communication bandwidth and the overhead of context switching. VP is limited by network capacity and has a maximum of Lg VP.
- Message latency is uncertain, but not greater than L. Messages experience wait times that are unpredictable, but maximum L without blocking.
- The LogP model encourages programmers to adopt some good strategies, such as job assignment, overlapping computing and communication, and balanced communication patterns.
- The actual running time of the method can be estimated.
Shortcomings of LogP model
- The communication mode in the network is not described deeply enough, for example, the retransmission message may occupy full bandwidth, and the intermediate router cache saturation are not described
- The LogP model is mainly applicable to the design of message passing algorithm. For shared storage mode, the remote read operation is simply considered as the equivalent of two message passing, ignoring pipeline prefetch technology, data inconsistency caused by Cache and Cache hit ratio.
- The context overhead of multithreading is not considered.
- The LogP model assumes that point-to-point messaging routers are used for communication, which increases the burden on the programmer to consider related communication operations on the router.
C3 model
The C3 model assumes that the processor cannot send and receive messages at the same time, and its performance analysis of super-step is divided into two parts:
- Unit of computation (CU), which depends on local computation;
- A communication unit (COU), which depends on how much data the processor sends and receives, the latency of the message, and the amount of congestion caused by the communication.
The model considers the effects of two types of routing (store-and-forward routing and insect path finding) and two types of send and receive primitives (blocking and non-blocking) on COU
Characteristics of C3 model
- CI and Cp are used to measure the effect of network congestion on algorithm performance.
- The influence of different routes and different send or receive primitives on communication is considered.
- The time complexity of a super-step can be evaluated without the user specifying scheduling details.
- Similar to the hierarchy structure of H-PRAM model, C3 model provides the programmer with the idea of K-level routing algorithm, that is, the system is divided into K-level subsystems, and the operation of each subsystem is independent of each other, and the sub-PRAM in H-PRAM is replaced by super step for segmentation.
Shortcomings of C3 model
- The premise of C metric is that two processors in the same communication pair should be located in different subnetworks of the peer network.
- The model assumes that the network bandwidth is equal to the processor bandwidth, thus affecting the correct description of scalable systems.
- In the K-level algorithm, the order between processors can have multiple permutations, but the C3 model cannot distinguish the difficulty of different permutations.
BDM model
In 1996, J.F.Jaja et al. proposed a Block Distributed Model (BDM), which is a bridge Model between shared storage programming mode and Distributed storage system based on messaging.
It has four main parameters.
- P: indicates the number of processors.
- T: the maximum delay time of the processor from sending the access request to getting the remote data, including the time for preparing the request, the time for routing the request packet in the network, the time for the destination processor to receive the request, and the time for returning M consecutive words in the packet to the original processor.
- M: M consecutive words in local memory.
- The time when the processor sends data to or receives data from the network.
Characteristics of BDM model
- Using M to reflect the characteristics of spatial locality provides a method to evaluate the performance of shared main memory algorithm, and measures the communication between processors caused by remote access.
- BDM recognizes pipeline technology. The time required for K prefetch of a processor is +KMo(otherwise, K(r+Mo). Good programmability.
- Storage competition in shared main memory is considered.
- It can be used to analyze network routing.
The deficiency of BDM model
- The initial data is considered to be in internal storage, requiring additional data movement operations for programmers who share main memory programs
- Factors affecting network latency, such as processor localization and network congestion, are not considered.
- System overhead is not considered.