Ceph distributed storage underlying implementation principle

Server SAN concept

A storage resource pool consisting of storage resources provided by multiple independent servers integrates storage and computing resources

SDS(Software Defined Storage)

Integrate hard disk resources and define these resources through software programming

Software-defined storage

Programmable, policy-driven
Storage virtualization and resource pooling
Heterogeneous storage resource management
API presents system capabilities

Storage Resource Layer

Traditional hardware and software integrated storage resources
Software-based/standard hardware storage

SDS versus traditional storage

Traditional storage system design

From the lower upward

Configure RAID groups for Disk groups. Configure storage pools for RAID groups and luns for storage pools. Map luns to upper-layer service hosts, that is, create 500 GB luns for hostsCopy the code

Design of SDS storage system

The top-down

The system creates a large storage pool, for example, 500 TB, and connects the storage pool on the virtualization layer or cloud platform to create a VM. For example, if the disk occupies 500 GB space, the system creates a vDisks in the storage pool by default. Instead of creating a vDisks for applications, the system allocates the entire storage pool to applications If a 500 GB hard disk is created on the cloud platform, a 500 GB virtual hard disk is automatically created in the storage pool. Many hard disks are combined at the bottom of the storage pool. RAID is not configured for all servers. All distributed storage devices have two solutions: 1. Copy mechanism (for example, three copies a copy of data on the hard disks of three servers) 2. EC algorithm Two mechanisms to ensure data reliabilityCopy the code

A copy of the data

Data and copies may not be in the same server or in the same cabinet or in the same roomCopy the code

New business leads to new resource supply model

Traditional chimney

SAN: provides a raw device to map space from the storage device to the host. What the host sees is a hard disk. The hard disk needs to be partited and formatted on the host Chimney stack: Each storage for its own services, such as SAN storage for databases, cannot be flexibly expandedCopy the code

disadvantage

Unbalanced resource utilization

For chimney stack storage, for example, a LUN of 500 GB is very busy and the disk where the LUN resides has bottlenecks. Another LUN of 500 GB is very idle and distributed storage. All data stores are evenly distributed So you can't have smokestack bottlenecksCopy the code

Expandability of
Multi-system platform management

New architecture - cloud

Ceph

Ceph unified storage architecture

One storage pool is used for object storage, one storage pool is used for block storage, and one storage pool is used for file storage. The storage pool is formatted into Ceph file system for file storage, namely CephFS A storage pool can only do one thingCopy the code

CephFS components

The process of storing data

Suppose there are 5 servers each with 12 hard disks 1TB each, then there are 60TB hard disks 3 copies, then there are 200TB disk space. Suppose the client needs to store 100 MB files. 1 Ceph uses DHT algorithm to hash to find a node such as Node1. 4. Back up the data on Node1 through the back-end networkCopy the code

Mon (monitor)

Metadata is stored in all mapping table relationships. Metadata also needs three copies. When data is read next time, the mapping table can be used to know which node the data of the file is stored on Ceph hashes to node1 D1 and has A backup on node2 over the back-end network D1 'B ceph hashes to Node3 D2 and has A backup on Node4 D2 This mapping is then recorded in MonitorCopy the code

When the client wants to get this data, it will get the mapping from monitor and integrate the fragmented data on each nodeCopy the code

A CEPh cluster has at least three nodes

The ceph cluster requires at least three nodes. If there are 100 nodes, the cepH cluster also requires monitors of three nodes. All other nodes are OSD nodesCopy the code

Object Storage Device (OSD) Object Storage Device

A hard disk corresponds to an OSD. An OSD is actually a process corresponds to a physical disk. OSD is a logical disk compared to CephCopy the code

Three nodes will solve the split brain

In the cluster through the voting mechanism to prevent brain split each node to cast one voteCopy the code

A distributed cluster has four nodes. If node1 and node2 are denoted as A on one network and Node3 and Node4 are denoted as B on the other network because network A and network B are isolated, nodes of two networks may target the same data store multiple times in the shared storage pool In addition, since node and Node2 are unaware of the existence of other nodes, they will try their best to grab cluster resources. To solve this problem, the voting mechanism is adopted, SUM/2+1 is the total number of votes /2+1, as shown in the figure 4/2+1=3 votes. If three of the four nodes in the cluster are alive, the system will work If you have 3 nodes in the cluster 3/2+1=2 if you have 2 nodes it will continue to work if you have A split brain where node1 and node2 are on one network and you call it network A Node3 and Node4 are on the other network and you call it network B and there are 2 nodes left in network A then you commit suicide Because it's safer to die than it is to live. At best, the service is down. If you don't die, you're going to mess up the dataCopy the code

MDS (MetaData Server)

Only CephFS has this role. This role is used to store file system metadata (such as writing data to EXT4 from NTFS or EXT4), but since it is a distributed file system, it has a role to store this metadata, so it is optional Monitor is the metadata that holds slicesCopy the code

Key concepts of Ceph

The MON calculates which OSD will store data and which hard disk will write the data to. The MDS can also store multiple copies of dataCopy the code

Ceph data storage procedures

A file is sliced and each piece corresponds to an object number oid Hash the OID and add a mask to get a GPID "GP is similar to a folder containing a small slice" Gpid is the folder number and then known through the Crush algorithm The files in the gPID folder are stored in osd nodesCopy the code

PG quantity planning

Placement Group The Placement GroupCopy the code

No matter how the data is written in, it's routed to a PG through the Crushmap mapping table and then the PG writes the data to the OSD and the OSD writes to the hard disk, okayCopy the code

Crushmap algorithm

Suppose a storage pool has 10 shards of unique data

So there are 30 shards crushmap if you count the copies and that's just breaking up the PG data and writing it to each hard drive and monitor manages the CrushmapCopy the code

Hash ring

At the moment of creating the storage pool, the hard disk will be divided into so many pieces that it will be put into a map. 10 pieces of a file will be cut into so that there will be 10 OID's. If you hash the OID, you will find a place on the hash ringCopy the code

Disadvantages: The amount of data is very large and each addressing is very slow

If there are 10 million objects, monitor records a lot of mapping data, and each query is very slow. In order to solve this problem, the concept of PG group is introducedCopy the code

Hash (OID) & mask yields a pGID

Let's say you have 128 PG hash(OID) values plus a mask to get a PGID, so you write data to PG. Pg is a logical concept. Pg is like a folder where you have a lot of files Then the GP data is written to some OSD nodes using the CRUSH algorithmCopy the code

Pg advantage

Narrow the search area to speed up the queryCopy the code

Don’t use PG

With PG

The more pg, the better?

The value is 2 to the n and the formula is: (Number of OSD nodes x Number of PGS on each OSD node)/Number of duplicates/Number of storage pools The default number of PGS on each OSD node is 100. Number of storage pools, for example, 2. Number of nine OSD nodes 9 x 100/3/2=150 So the number of PGS is 256 and the number of PGS is planned at the time of creationCopy the code

OSD writing process

The client writes to the master's cache and the master's cache writes to the cache of the two copies and then responds to the master and the master copy responds to the clientCopy the code

Ceph is available for back-end storage of Oracle databases

Ceph can also connect to RGW object storage

Ceph distributed storage underlying implementation principle

Server SAN concept

SDS(Software Defined Storage)

SDS versus traditional storage

New business leads to new resource supply model

Ceph

CephFS components

Key concepts of Ceph

Ceph data storage procedures

Related Posts

Gson User Guide translated from Gson User Guide

Authoritative summary of Java logging architecture

Kill null’s Optional | August challenge