EC is introduced
Erasure Coding (EC) : Erasure Coding
EC (erasure code) is a coding technique that, prior to HDFS, was most widely used in inexpensive redundant array of disks (RAID) (RAID Introduction: Big data preparation knowledge – storage disk, disk redundant array of RAID), RAID by striping technology to realize the EC, is a kind of automatic striping technology will I/O load balancing on a number of physical disk technology, principle is to a block of contiguous data is divided into many small and store them separately to different disk, This can make multiple processes to access the data at the same time a number of different parts of the without conflicts disk (when multiple processes to access a disk at the same time, there may be a disk conflict), and when need to order this kind of data access can obtain maximum ability of I/O in parallel to obtain very good performance. In HDFS, called the continuous data is divided into many small striping unit, the unit for the original data of each stripe unit, compute and store a certain number of parity check unit, the computing process known as encoding, can be based on the remaining data and parity check unit of decoding error in calculation to recover any striping unit.
HDFS data redundancy storage policy
HDFS storage strategy is the replica mechanism, which makes the security of data storage improved, but at the same time, it also brings extra overhead. The default HDFS 3 replica scheme has 200% extra overhead in storage space and other resources (such as network bandwidth), but for the data with relatively low I/O activity, Other block replicas are rarely accessed during normal times, but still consume the same amount of resources as the first copy. Thus, a major improvement in HDFS 3.x is the replacement of the copy mechanism with erasure codes (EC), which provide the same fault-tolerance as the copy mechanism, but with much less storage space. In a typical erasure code (EC) setup, storage overhead is no more than 50%.
EC algorithm implementation principle
There are many kinds of EC algorithm, a more common algorithm is Reed-Solomon (RS), it has two parameters, denoted as RS(k,m), k represents the data block,m represents the check block, the number of check blocks can tolerate the loss of as many blocks (including data block and check block), the specific principle is explained by the following example:
We use RS(3,2), which means that we use three raw data blocks and two check blocks
Example: there are three original data 7, 8, 9, throughMatrix multiplication, two check data 50 and 122 are calculated. At this time, the original data plus the verification data, a total of five data: 7, 8, 9, 50, 122, you can arbitrarily throw two, and then recover through the algorithm
GT is the generating matrix, and the generating matrix for RS(k,m) is the matrix for m rows and k columns
Data represents raw Data, and 7,8, and 9 represent blocks of raw Data
Parity stands for check data, 50,122 for check data blocks
So with 3 raw data blocks, if 2 check blocks are used, EC encoding takes up a total of 5 data blocks of disk space, which is equivalent to the fault tolerance of the 2 copy mechanism which takes up 6 data blocks.
Application scenarios of EC
Integrating EC technology into HDFS can improve storage efficiency while still providing data persistence similar to traditional replica-based HDFS deployments. For example, a 3 copy file with 6 blocks will consume 6 * 3 = 18 disk space. However, when deployed with EC(six data, three validations), it will consume only nine blocks of disk space.
However, EC will use a lot of CPU resources during the encoding process and data reconstruction, and most of the data is read remotely, so there will be a lot of network overhead.
Therefore, when the CPU resources are tight and the storage cost is low, the copy mechanism can be used to store the data; when the CPU resources are surplus and the storage cost is high, the EC mechanism can be used to store the data.
EC architecture in HDFS
HDFS uses Online EC directly, avoiding the conversion phase and saving storage space. Online EC also enhances sequential I/O performance by leveraging multiple disk spindles in parallel. This is especially ideal in clusters with high-end networks. Second, it naturally distributes a single small file to multiple DataNodes without having to bundle multiple files into a single encoding group. This greatly simplifies file operations such as deletions, disk quotas, and moves between Namespaces.
In a typical HDFS cluster, small files can account for more than 3/4 of the total storage consumption. In order to better support small files, HDFS supports Striping Layout EC solution in the first phase. Currently, HDFS Contiguous Layout solution is also in progress
- Bar Layout:
Advantages:
- The client caches less data
- This applies regardless of file size
Disadvantages:
- This affects the performance of some location-sensitive tasks because blocks that were previously on one node are spread out across many different nodes
- And multi-copy storage strategy conversion is more troublesome
- Continuous layout:
Advantages:
- Easy to implement
- Convenient and multi-copy storage strategy for conversion
Disadvantages:
- The client needs to cache enough blocks of data
- Not suitable for storing small files
The basic unit of files in HDFS in traditional mode is block, while the basic unit of files in EC mode is block group. Take RS(3,2) as an example, each block group contains 3 data blocks and 2 check blocks.
The main extensions HDFS makes to the introduction of EC mode are as follows:
- NameNode: HDFS files are logically composed of block groups, each of which contains a certain number of internal blocks. In order to reduce the memory consumption of NameNode by these internal blocks, HDFS has introduced a new hierarchical block naming protocol. The ID of a block group can be inferred from the ID of any of its inner blocks. This allows management at the block group level rather than the block level
- Client: The Client read and write paths have been enhanced to process multiple internal blocks in a block group in parallel
-
DataNode: The DataNode runs an additional ErasuRedingWorker (EcWorker) task that is used for background recovery of failed erasure code blocks. The NameNode detects a failed EC block and selects a DataNode for recovery. This process is similar to how to restore a copy of the block if it fails. Reconstruction performs three key task nodes:
- Read data from the source node: Use a dedicated thread pool to read input data in parallel from the source node. Based on EC policy, read requests are made to all source targets and read only the minimum number of input blocks for reconstruction.
- Decode data and generate output data: Decode new data and parity blocks from input data. All missing data is decoded together with the parity block.
- Transfer the generated data block to the target node: After decoding is complete, the recovered block is transferred to the target DataNodes.
-
Erasure code policies: To accommodate heterogeneous workloads, files and directories in the HDFS cluster are allowed to have different copy and erasure code policies. Erasure code policies encapsulate how to encode/decode files. Each policy is defined by the following information:
- EC pattern: This includes the number of data and parity blocks in the EC group (for example, 6 + 3), and the codec algorithm (for example, Reed-Solomon, XOR).
- The size of the striped unit. This determines the granularity of strip reads and writes, including buffer sizes and coding efforts.
We can define our own EC policy in an XML file, which must contain the following three parts:
- LayoutVersion: This represents the version of the EC policy XML file format.
- Schemas: This includes all user-defined EC schemas.
- Policies: This includes all user-defined EC policies, each consisting of a Schema ID and the size of the striped unit (cellSize).
The Hadoop conf directory has a sample XML file for configuring EC policies that you can refer to when configuring. The file name is user_ec_polices.xml.template.
Hardware configuration for the cluster
Erasure codes impose additional requirements on the cluster in terms of CPU and network:
- The encoding and decoding work will consume additional CPU on the HDFS client and DataNode.
- Erasure code files are also distributed throughout the rack to achieve rack fault tolerance. This means that when reading and writing striped files, most of the operation takes place on the rack. Therefore, the network bisection of bandwidth is very important.
- For rack fault tolerance, it is also important to have a rack with at least as much width as the configured EC strips. For EC strategy RS (6,3), this means a minimum of 9 racks, ideally 10 or 11 racks, to handle planned and unplanned outages. HDFS cannot maintain rack fault tolerance for clusters with racks less than the width of the stripe, but will still attempt to distribute striped files across multiple nodes to preserve node-level fault tolerance.
The last
By default, all EC policies are disabled, and we can enable EC policies with the HDFS EC [-EnablePolicy-Policy] command, depending on the size of the cluster and the required fault-tolerant properties. For example, for a cluster with nine racks, a strategy such as RS-10-4-1024K will not retain rack-level fault tolerance, whereas RS-6-3-1024K or RS-3-2-1024K may be more appropriate.
Under the replica mechanism, we can set the replica factor and specify the number of copies, but under the EC policy, it is meaningless to specify the replica factor because it is always 1 and cannot be changed by the relevant command.