1. Oceanbase

1.1 Oceanbase Basic Architecture

  • Multiple replicas: OceanBase is typically deployed in three zones, and each Zone consists of multiple nodes/servers (Observers)
  • Full peer node: Each node has its own SQL engine and storage engine to manage different data partitions
  • None Shared: OceanBase data is distributed on each node and is not based on any shared storage structure

Data partitioning: The basic unit of OceanBase data architecture. It is the implementation of partitioned tables in traditional databases on distributed systems. High availability + strong consistency: Multiple copies +Paxos protocol to ensure data (log) write (persistent) to at least two machines (two or three copies of baseline data are distributed on ChunkSever in each Zone, and incremental data is in one active and one standby on UpdateSever)

ObProxy: millions of processing power proxy, routing and forwarding, lightweight SQLParser, stateless

  • Reverse proxy function
  • Performance requirements
  • Operational requirements

1.2 Oceanbase features

  • High performance: The storage adopts the read/write separation architecture, full-link performance optimization of the computing engine, and quasi-memory database performance.
  • Low cost: PC servers and low-end SSDS are used to lower storage costs with high storage compression ratio and lower computing costs with high performance. The multi-tenant hybrid storage system fully utilizes system resources.
  • High availability: Data is stored in multiple copies. The failure of a few copies does not affect data availability. Automatic and non-destructive city-level fault disaster recovery (Dr) is implemented in the three locations and five centers deployment mode.
  • Strong consistency: Multiple copies of data are synchronized to the transaction log over the PaxOS protocol, and most successful transactions are committed. By default, both read and write operations are performed on the primary copy to ensure strong consistency.
  • Scalable: All nodes in the cluster are peer, and each node has computing and storage capabilities without a single point of bottleneck. Can be linear, online expansion and contraction.
  • Compatibility: It is compatible with common MySQL/ORACLE functions and MySQL/ORACLE front-end and back-end protocols. Services can be migrated from MySQL/ORACLE to OceanBase with little or no modification.

1.3 Application Scenarios

  • Financial level data reliability requirements

Financial environment usually has higher requirements on data reliability. For each transaction submitted by OceanBase, the corresponding log is always synchronized in multiple data centers in real time and persisted. Even in the event of a data center level disaster, every transaction that has been completed can always be recovered in another data center, achieving true financial level reliability requirements.

  • Adapt the database to the rapidly growing business

The rapid growth of the business, usually bring double pressure to the database. As a truly distributed relational database, OceanBase uses an independent general computer as each node of the system. Data is automatically distributed to each node according to capacity and availability. When the data volume keeps increasing, OceanBase can automatically expand the number of nodes to meet business requirements.

  • Uninterrupted service

The continuous service of the enterprise usually means giving customers the most smooth product experience. In distributed OceanBase cluster, if a node is abnormal, the service node can be automatically removed. The corresponding data of this node has multiple copies and corresponding data services are provided by other nodes. Even if a data center is abnormal, OceanBase can switch service nodes to other data centers within a short period of time to ensure service continuity.

1.4 Enterprises in use

1.5 Salary

1.6 Related articles or videos

Alibaba open source database –OceanBase from usage to architecture analysis

2. TiDB

2.1 Overall architecture of TiDB

TiDB Server

TiDB Server is responsible for receiving SQL requests, processing SQL-related logic, and finding TiKV address for storing data required for calculation through PD, interacting with TiKV to obtain data, and finally returning results. TiDB Server is stateless. It does not store data itself but only computes data. It can be expanded infinitely and can provide unified access addresses externally through load balancing components (LVS, HAProxy, or F5).

PD Server

Placement Driver (PD for short) is the management module of the whole cluster. It mainly works in three aspects: first, it is the meta-information of the storage cluster (a Key is stored in which TiKV node); Second, TiKV cluster scheduling and load balancing (such as data migration, Raft group leader migration, etc.); Third, allocate globally unique and increasing transaction ids. PD is a cluster, and an odd number of nodes need to be deployed. It is recommended that at least three nodes be deployed online. PD cannot provide external services during the election process, which takes about three seconds.

TiKV Server

TiKV Server is responsible for storing data. Externally, TiKV is a distributed key-value storage engine that provides transactions. The basic unit of data storage is Region. Each Region stores data of a Key Range (from StartKey to EndKey). Each TiKV node is responsible for multiple regions. TiKV uses Raft protocol for replication to ensure data consistency and disaster recovery. Replicas are managed by Region. Multiple regions on different nodes form a Raft Group and are replicas of each other. The load balancing of data among multiple TiKV is scheduled by PD, which is based on Region

2.2 TiDB features

  • One-click horizontal capacity expansion or reduction. Thanks to the architecture design of TiDB storage and computing separation, computing and storage capacity can be expanded or reduced online as required. The adjustment process is transparent to application O&M personnel.

  • Financial high availability data is stored in multiple copies. Transaction logs are synchronized with data copies using multi-raft protocol. Transactions are committed only when the majority of data copies are written successfully, ensuring data consistency and ensuring data availability when a few copies fail. You can configure policies such as the location and number of replicas as required to meet the requirements of different Dr Levels.

  • Real-time HTAP provides TiKV and TiFlash storage engines. TiFlash replicates data from TiKV in real time through multi-raft Learner protocol to ensure strong data consistency between TiKV and TiFlash. TiKV and TiFlash can be deployed on different machines as required to solve the problem of HTAP resource isolation.

  • Cloud native distributed database A distributed database designed for the cloud can be deployed instrumentally and automatically in public, private, and hybrid clouds using TiDBOperator.

  • Compatible with MySQL5.7 protocol and MySQL ecology compatible with MySQL5.7 protocol, common functions of MySQL, MySQL ecology, applications can be migrated from MySQL to TiDB without requiring or modifying a small amount of code. A variety of data migration tools are provided to facilitate data migration.

2.3 Application Scenarios

  • Financial industry scenarios that require high data consistency, high reliability, high system availability, scalability, and disaster recovery. As we all know, the financial industry has high requirements for data consistency, high reliability, high availability, scalability and disaster recovery. In the traditional solution, two equipment rooms in the same city provide services and one room in a remote city provides data Dr Capability but does not provide services. This solution has the following disadvantages: Resource utilization is low, maintenance costs are high, and RTO(RecoveryTimeObjective) and RPO(recoverypoint ective) cannot reach the desired value of the enterprise. TiDB uses multi-copy + multi-raft protocol to schedule data to different machine rooms, racks, and machines. When some machines fail, the system automatically switches over to ensure RTO<=30s and RPO=0.

  • Massive data and high-concurrency OLTP scenarios that require high storage capacity, scalability, and concurrency.

With the high-speed development of the business, the data show explosive growth, the traditional stand-alone database can’t satisfy because of explosive growth data to the database capacity requirements, and feasible solution is to use depots table or NewSQL database middleware products substitute, adopt the high-end storage devices, such as one of the biggest ratio was NewSQL database, such as: TiDB. TiDB supports a maximum of 512 computing nodes, each node supports a maximum of 1000 concurrent applications, and the cluster capacity supports a maximum of PB levels.

  • Real-time HTAP scenario. With the rapid development of 5G, Internet of things and artificial intelligence, enterprises will produce more and more data, whose scale may reach hundreds of TB or even PB level. The traditional solution is to process online transaction business through OLTP-type database, and synchronize data to OLAP type database through ETL tool for data analysis. This processing scheme has many problems such as high storage cost and poor real-time performance. TiDB introduces column storage engine TiFlash and row storage engine TiKV in version 4.0 to build a real HTAP database. Under the condition of increasing a small amount of storage cost, online transaction processing and real-time data analysis can be done in the same system, which greatly saves the cost of enterprises.

  • Data aggregation and secondary processing scenarios. Most of the current enterprise business data are scattered in different systems, without a unified summary, with the development of business, enterprise policy makers need to understand the company’s business conditions in order to make decisions in time, therefore, need to be scattered in the system of data gathering in the same system and secondary processing to generate reports of T + T + 0 or 1. The traditional common solution is to use ETL + Hadoop to complete, but the Hadoop system is too complex, and the operation and maintenance and storage costs are too high to meet the needs of users. Compared to Hadoop, TiDB is much simpler. The business uses ETL tools or TiDB synchronization tools to synchronize data to TiDB, where reports can be generated directly through SQL.

TiDB V4.0 has a number of improvements in stability, ease of use, performance, security and functionality

2.4 Enterprises in Use

TiDB, with its leading technical capabilities and perfect commercial service support system, helps users in finance, Internet, retail, logistics, manufacturing, public service and other industries to build future-oriented data service platforms. TiDB is currently used in more than 1500 head companies worldwide. including China bank, everbright bank, Shanghai pudong development bank, zheshang bank, bank of Beijing, the bank, bank, China unionpay, China life, ping an life insurance, Lu Jin, China mobile, China unicom, China telecom, China jun choi, state grid, ideal car, xiao peng car, VIVO, OPPO, yum China, China post, suitable abundant speed luck, zhongtong express, teng Xun, Meituan, JD, Pinduoduo, Xiaomi, Sina Weibo, 58.com, 360, Zhihu, IQiyi, Bilibili, Square (US), Dailymotion (France), Shopee (Singapore), ZaloPay (Vietnam), BookMyShow (India), etc.

2.5 Salary

2.6 Relevant articles or videos

TiDB overview and overall architecture

3. Cassandra

Cassandra is an open source, distributed, central-node-less, resilient, scalable, highly available, fault-tolerant, conformed, column-oriented NoSQL database

3.1 Cassandra internal architecture

3.1.1 Cassandra Cluster (Cluster)

  • Cluster
    • Data center(s)
      • Rack(s)
        • Server(s)
          • Node (more accurately, a vnode)
  • Node: An instance that runs Cassandra
  • Rack: A collection of nodes
  • DataCenter: A set of racks
  • Cluster: maps to a collection of all nodes that have a complete token ring

3.1.2 Coordinator

When a client connects to a node to initiate a read/write request, the node acts as a bridge between the client application and the corresponding data nodes in the cluster, called the coordinator, to determine which node in the ring should receive the request based on the cluster configuration. The specified -H node using CQL connections is the coordination node

  • Any node in the cluster can be the coordinator
  • Each client request may be coordinated by a different node
  • The coordinator manages replicators (replicators: how many nodes should a new piece of data be replicated to)
  • Coordinator requests conformance levels (conformance levels: requests for how many nodes in the cluster must read and write accordingly)

3.1.3 Partitioner

The partitioner determines how data is distributed within the cluster. In Cassandra, each row of the table is identified by a unique primarykey, and the Partitioner is actually used as a hash function to calculate the token of the primarykey. Cassandra places rows in the cluster based on this token value.

Cassandra comes with three different dividers

  • Murmur3Partitioner (default) – Distributes data evenly across the cluster based on the MurmurHash hash value
  • RandomPartitioner – Distributes data evenly across a cluster based on the MD5 hash value
  • ByteOrderedPartitioner – Maintains an ordered distribution of data terms by the bytes of the key

3.1.4 Virtual Node (Vnode)

Each virtual node corresponds to a token value, and each token determines the node’s position in the ring and the range of continuous data hash values that the node should undertake. Therefore, each node has a continuous segment of tokens, and this continuous segment of tokens forms a closed ring.

Without virtual nodes, the number of tokens in the Ring Ring equals the number of machines in the cluster. For example, there are 6 nodes, so the token count =6. Since the replica factor =3, a record must exist on three nodes in the cluster. The simple way is to calculate the hash value of the rowkey on which token in the ring falls, with the first data on that node and the remaining two copies falling on the last two nodes of the token in the ring.

In the figure, A,B,C,D,E, and F are the range of keys. The real value is the hash ring space, for example, 0 to 2^32 divided into 10 parts. Each segment is 1/10 of 2^32. Node 1 contains A,F, and E, indicating that data in the key range A,F, and E will be stored on node 1. And so on.

If virtual nodes are not used, manually calculate and assign a token to each node in the cluster. Each token determines the node’s position in the ring and the range of consecutive data hash values that the node should assume. In the top half of the figure, each node is assigned a separate token representing a position in the ring, and each node stores the row key mapped to a hash value that falls within a unique continuous hash value range that the node should assume. Each node also contains copies of rows from other nodes.

Using virtual nodes allows each node to have multiple smaller discrete ranges of hash values. In the lower part of the figure, the nodes in the cluster use virtual nodes, which are randomly selected and discontinuous. The location of the data is also determined by the hash value mapped from the row key, but falls within a smaller partition range.

Benefits of using Virtual Nodes

  • There is no need to calculate and assign tokens to each node
  • After nodes are added and removed, there is no need to rebalance cluster load
  • It is faster to rebuild abnormal nodes

3.1.5 Replication

There are currently two replication strategies available:

  • SimpleStrategy: Only for a single data center, put the first Replica in the node identified by the Partitioner and the other Replicas in the subsequent nodes clockwise from the above node.

  • NetworkTopologyStrategy: Applies to complex multi-data centers. You can specify how many Replicas to store in each data center.

The replication policy is specified when the keyspace is created, for example

CREATE KEYSPACE Excelsior WITH REPLICATION = { 'class' : 'SimpleStrategy'.'replication_factor' : 3 };  
CREATE KEYSPACE Excalibur WITH REPLICATION = {'class' :'NetworkTopologyStrategy'.'dc1' : 3.'dc2' : 2};
Copy the code

The data center names, such as DC1 and DC2, must be the same as those configured in SNitch. The topology policy above indicates that three duplicates are configured on DC1 and two duplicates are configured on DC2

3.1.6 Gossip

The Gossip protocol is an internal communication technology for cluster nodes to communicate with each other. Gossip is an efficient, lightweight, and reliable broadcast protocol for exchanging data between nodes. It is a decentralized, fault-tolerant, point-to-point communication protocol. Cassandra uses Gossipibing for peer discovery and metadata propagation.

The gossip protocol is represented as a seeds node in a configuration file. One caveat is that the seed nodes of all nodes in the same cluster should be consistent. Otherwise, if the seed nodes are inconsistent, sometimes the cluster splits, that is, two clusters. Seed nodes are usually started first to discover other nodes in the cluster as soon as possible. Each node exchanges information with other nodes, and due to randomness and probability, all nodes in the cluster must be enumerated. Each node also holds all the other nodes in the cluster. So any node you connect to can know all the other nodes in the cluster. For example, if the CQL is connected to any node in the cluster, it can obtain the status of all nodes in the cluster. That is, any node should have the same state of information about the nodes in the cluster!

The gossip process runs once per second, exchanging information with up to three other nodes so that all nodes can quickly learn about other nodes in the cluster. Because the entire process is decentralized, there is no node to coordinate the gossip information of each node. Each node independently selects one to three nodes for gossip communication. It always selects active peers in the cluster, and probabilistically selects seed nodes and unreachable nodes.

The Gossip protocol is similar to the TCP three-way handshake. It uses the regular broadcast protocol to have only one message in each round and allow messages to spread gradually within the cluster. However, with the Gossip protocol, each message contains only three messages, which increases the inverse entropy to a certain extent. This process allows for rapid convergence between nodes exchanging data with each other.

First, the system needs to configure several seed nodes, such as A and B. Each participating node will maintain the state of all nodes. Node ->(Key,Value,Version), A larger Version indicates that its data is relatively new, and node P can only update its own state directly. Node P can only indirectly update data of other nodes maintained locally through the Gossip protocol.

The general process is as follows:

  • SYN: node A randomly selects some nodes. In this case, it can choose to send only the digest, that is, not the valus, to avoid large messages
  • ACK: After receiving the message, node B merges the message with the local one. In this case, the comparison version is adopted. A larger version indicates that the data is relatively new. For example, node A sends data C(key, Valuea,2) to node B, and node B stores data C(key, Valueb,3) locally. Since the version of node B is relatively new, the merged data is the data stored locally by node B and is then sent back to node A.
  • ACK2: node A receives the ACK message and applies it to the local data.

3.1.7 Consistency Level

Consistency refers to whether all sets of copies of a row of data are up to date and synchronized. Cassandra extends the concept of final consistency. For a read or write operation, the concept of adjustable consistency means that the client initiating the request can specify the consistency required by the request through consistency level parameter. Consistency of write operationsNote:

Even if consistency level ON or LOCAL_QUORUM is specified, write operations are sent to all replica nodes, including replica nodes in other data centers. Consistency level only determines the number of replicas that need to ensure a successful write operation before notifying the client of a successful request.

Read consistency

3.2 characteristics of Cassandra

  • High scalability: Cassandra is highly scalable, helping you add more hardware on the fly to attach more customers and more data on demand.
  • Rigid architecture: Cassandra does not have a single point of failure and can be used for business-critical applications that cannot afford failure.
  • Fast linear scale performance: Cassandra is linearly scalable. It improves throughput because it helps you increase the number of nodes in your cluster. Therefore, it maintains fast response times.
  • Fault tolerance: Cassandra is fault tolerant. Suppose there are four nodes in the cluster, where each node has a copy of the same data. If one node is no longer in service, the other three nodes can be serviced as requested.
  • Flexible data storage: Cassandra supports all possible data formats such as structured, semi-structured, and unstructured. It helps you change data structures as needed.
  • Simple data distribution: Data distribution in Cassandra is very simple because it has the flexibility to distribute the required data by replicating data across multiple data centers.
  • Transaction support: Cassandra supports transactions with properties such as atomicity, consistency, isolation, and durability (ACID).
  • Quick writes: Cassandra is designed to run on inexpensive commodity hardware. It performs fast writes and can store hundreds of terabytes of data without sacrificing read efficiency.

3.3 Cassandra Application Scenarios

The ideal Cassandra application has the following characteristics:

  • The write exceeds the read substantially.
  • Data is rarely updated, and when updated they are idempotent.
  • Query by primary key, non-secondary index.
  • You can partition evenly with the partitionKey.
  • No Join or aggregation is required.

Some good scenarios to recommend using Cassandra are:

  • Transaction log: purchases, test scores, movies watched, etc.
  • Store timing data (you need to aggregate it yourself).
  • Track almost anything, including order status, packages, etc.
  • Store health tracking data.
  • History of weather services.
  • Iot status and event history.
  • Iot data for cars.
  • E-mail

3.4 Cassandra is being used by the enterprise

3.4.1 Cisco case

  • Use cases

  • Cisco Business Update Cloud

  • Business analysis and reporting

3.4.4 Other Usage

3.5 Relevant articles or videos

High-performance storage engine RocksDB module Details RocksDB concepts and simple applications

4.RocksDB

RocksDB is an embedded KV storage engine written in C++ and developed by Facebook based on levelDB. It supports a variety of storage hardware and uses a log-structured database engine (based on lsm-tree) to store data.

4.1 RocksDB architecture

Rocksdb infrastructureRocksDB read and write process

4.2 RocksDB characteristics

One of RocksDB’s major improvements over traditional relational databases is the LSM tree storage engine.

4.2.1 LSM (Log-structured Merge Tree)

LSM-tree

LSM tree also avoids disk random write problems through bulk storage technology. LSM trees are designed in a very simple way. The principle of LSM trees is to divide a large tree into N small trees, which are first written to memory (memory does not have the problem of seeking speed, and the performance of random write is greatly improved), and then to build an ordered small tree in memory. As the tree grows larger, the small tree in memory is flushed to disk. You can merge disks to form a large tree periodically to optimize read performance.

4.2.2 LevelDB characteristics

LevelDB

  1. LevelDB is a persistent storage KV system, and unlike Redis, a memory based KV system, LevelDB does not eat as much memory as Redis does, but stores most of its data to disk.
  2. LevleDb stores data in order according to recorded keys, that is, adjacent keys are stored in sequence in storage files. Applications can customize key size comparison functions.
  3. LevelDB supports the snapshot function, which ensures that read operations are not affected by write operations and that consistent data is always displayed during read operations.
  4. LevelDB also supports operations such as data compression, which directly help reduce storage space and increase I/O efficiency.

4.2.3 RocksDB optimization of LevelDB

  1. Column family is added to facilitate multiple unrelated data sets to be stored in the same DB. Since data of different column families are stored in different SST and MEMtable, isolation is played to a certain extent.
  2. The speed of compact is optimized by combining multiple compacts simultaneously.
  3. The merge operator is added to optimize the efficiency of modify.
  4. Separate flush and compaction into separate thread pools to speed up flush and prevent stall.
  5. Added a special management mechanism for Write Ahead logs (WAL), which makes it easier to manage WAL files because WAL is a binlog file.

4.2.4 RocksDB architecture analysis

Advantage:

  1. All the flush operations are in append mode, which is very attractive for disks and SSDS;
  2. Write operation The WAL and Memtable are returned immediately after being written, which is highly efficient.
  3. Since the final data is stored in discrete SST, the size of SST files can be configured freely according to the size of KV, so it is suitable for variable length storage.

Disadvantage:

  1. WAL is shared by multiple CFS to support batch and transaction operations and power-on and restore operations. As a result, WAL’s single-threaded write mode fails to give full play to the performance advantages of high-speed devices (compared with other structures such as B-tree).
  2. All read and write operations need to be mutually exclusive to Memtable, which is prone to bottlenecks in multi-threaded concurrent write and mixed read and write scenarios.
  3. The Level0 level files are scanned in order of time, not according to the scope of the key, so the scope of each file overlaps, plus the file merges from top to bottom, you may need to find multiple files at Level0 level and other files, which will cause a large read. In particular, when a pure random write is performed, almost all files at level0 are queried, resulting in inefficient read operations.
  4. Aiming at the third problem, Rocksdb does the foreground write flow control and background merge trigger according to the number of level0 files, so as to balance the read and write performance. This in turn leads to performance jitter and failure to play the performance of high-speed media.
  5. It is difficult to control the merging process, which can easily cause performance jitter and write amplification. Especially the write magnification problem, in the author’s use of the actual test of write magnification often reached about 20 times. This is unacceptable, and we have not found a suitable solution at present. We just temporarily adopt the method of large value separated storage to control the write magnification in small data as much as possible.

4.3 Application Scenarios

4.3.1 Application Scenarios

  1. It requires high write performance and has large memory to cache SST blocks to provide fast read scenarios.
  2. SSD is sensitive to write magnification and disk is sensitive to random write.
  3. Scenarios requiring variable-length KV storage;
  4. Access to small-scale metadata;

4.3.2 Not Applicable to Scenarios

  1. In the case of large value, KV separation is required;
  2. Access to large-scale data

4.4 Enterprises in Use

4.5 Relevant articles or videos

High-performance storage engine RocksDB module Details RocksDB concepts and simple applications

5.MemDB

5.1 MemDB architecture

Memdb is a flexible in-memory database that provides transaction support, SQL read and write capabilities, and supports multi-center WAN cluster deployment. It is used to build and accelerate applications that require ultra-high-speed data interaction and have high scalability. At the other end of the cloud cloud Insight takes you to play big data can’t make vase fresh meat cloud insight three superb skills rich in body systems.

5.2 MemDB characteristics

  1. MemDB is a key/value persistent storage platform between memcache, Redis and mysql.
  2. Different from Memcache and Redis, MemDB can efficiently store data of different types of data and business rules by expanding the storage engine.
  3. Unistor provides a consistent API for external access for different engines. However, the storage engine can tailor and expand the interface to meet its own service needs through the extension fields of the MemDB API.
  4. MemDB itself does not support grouping, but users can divide based on the range of keys (or hash). The system supports data export based on key ranges. The key size comparison and hash are determined by the user’s storage engine
  5. MemDB implements clustering through ZooKeeper to ensure high availability of the system. A cluster forwards messages internally, regardless of the host. You can create master and slave clusters.
  6. MemDB provides a configurable Read and write Cache to ensure efficient Read and write operations.
  7. MemDB has its own binlog to ensure high reliability of system data, and data synchronization adopts multiple connections to prevent blocking. Supports efficient data synchronization across IDCs.
  8. MemDB provides complete operation information for co-operation and maintenance. This information can be obtained either through the MC STATS directive that monitors the port or through the GET/GETS interface, where the I parameter has a value of 2(to get system information).
  9. MemDB provides unified o&M tools.
  10. MemDB storage engine development is very simple.

5.3 Application Scenarios

Cloud Computing Insight is one of the three categories of products, the other two categories of products are big data processing platform Hadoop, distributed parallel database MPP

5.4 Enterprises in Use

5.5 Relevant articles or videos

Distributed transaction memory database –MemDB