4. Introduction to NoSql data model

An e-commerce customer, order, order, address model to compare the relational database and non-relational database

The aggregation model

1 KV key values 2 Bson 3 column familyAs the name implies, data is stored in columns. The biggest feature is easy to store structured and semi-structured data, easy to do data compression, for a certain column or several columns of the query has a very large I/O advantage.

4 graphics

5. Four categories of NoSql database

5.1 KV key value: Typical introduction

  • Sina: BerkeleyDB + redis
  • Meituan: redis + tair
  • Alibaba, Baidu: memcache+ Redis

5.2 Document Database (bSON format) : Typical description

  • CouchDB
  • MongoDB

MongoDB is a database based on distributed file storage. Written in C++ language. Designed to provide scalable high-performance data storage solutions for WEB applications.

MongoDB is a product between relational database and non-relational database. Among non-relational databases, it has the most rich functions and is the most like relational database.

5.3 Column store database

  • Cassandra, HBase
  • Distributed file system

5.4 Graph relational databases

  • It’s not about graphics, it’s about relationships: moments, social networks, AD recommendation systems
  • Social networks, recommendation systems, etc. Focus on building the relationship map
  • Neo4J, InfoGrid

5.5 Comparison of the four

6. CAP principle CAP+BASE in distributed database

6.1

Relational databases follow the ACID rule

A transaction is similar to a transaction in the real world. It has the following four characteristics: A: Atomicity is easy to understand. It means that all operations in A transaction are either done or not done. The condition for A transaction to succeed is that all operations in the transaction are successful. For example, in bank transfer, 100 yuan is transferred from account A to account B. There are two steps: 1) Withdraw 100 yuan from account A; 2) Deposit 100 yuan into account B. These two steps either complete together, or do not complete together, if only complete the first step, the second step failed, the money will be puzzling less 100 yuan. Consistency is also easy to understand, that is, the database must always be in a consistent state, transaction running will not change the original Consistency constraints of the database. If a transaction is accessing data that is being modified by another transaction, the data it accesses is not affected by the uncommitted transaction as long as the other transaction is not committed. For example, there is A transaction for transferring 100 yuan from Account A to Account B, if B looks up their account, they can’t see their new 100 yuan. Durability means that their modifications are permanently saved in the database once the transaction is committed. It will not be lost even in the event of an outage.

6.2 CAP

  • C:Consistency
  • A:Availability
  • P:Partition tolerance

6.3 CAP 3 into 2

CAP theory says that in a distributed storage system, only two of the above can be implemented. And because the current network hardware will certainly appear delay packet loss and other problems, so

Partition tolerance is something we must implement.

So there is a trade-off between consistency and availability, and no NoSQL system can guarantee all three.

C: strong consistency A: high availability P: distributed tolerance CA Traditional Oracle databases

AP’s choice of most website architectures

CP Redis, mongo

Note: Trade-offs must be made when working with distributed architectures. Strike a balance between consistency and availability. More than most Web applications, strong consistency is not really required. So sacrifice C for P, which is the current direction of distributed database products

Consistency versus usability

Many of the key features of relational databases are often unusable for web2.0 sites

Database transaction consistency Requirements Many Web real-time systems do not require strict database transactions, have low requirements for read consistency, and in some cases have low requirements for write consistency. Allow for final consistency.

For relational databases, it is certain that the data can be read out immediately after inserting a piece of data, but for many Web applications, it does not require such a high real-time, for example, after sending a message, after a few seconds or even more than ten seconds, It is perfectly acceptable for my subscribers to see this dynamic.

For complex SQL query, especially the demand for multi-table associated query, any large data volume Web system is very taboo to multiple large table associated query, as well as complex data analysis type report query, especially SNS type website, from the demand and product design Angle, to avoid this situation. Often more is a single table of primary key query, as well as a single table of simple conditional paging query, SQL function has been greatly weakened.

6.4 Classic CAP diagram

The core of CAP theory is that it is impossible for a distributed system to satisfy the three requirements of consistency, availability and fault tolerance of partition at the same time, but only two at most. Therefore, according to the PRINCIPLE of CAP, NoSQL databases are divided into three categories that meet the CA principle, meet the CP principle and meet the AP principle: CA-single point cluster, meet the consistency, availability of the system, usually not very strong in scalability. CP – Meet the consistency, partition tolerance required system, usually not particularly high performance. AP – Systems that meet availability and partition tolerance may generally have lower requirements for consistency.

6.5 the BASE

BASE is a solution to reduce availability caused by the strong consistency of relational databases.

BASE is Basically Available Soft state Eventually consistent.

The idea is that the overall scalability and performance of the system can be improved by relaxing the system’s demands for consistency at one point in time. Why do you say so? The reason is that it is impossible for large systems to use distributed transactions to complete these indicators because of the geographical distribution and extremely high performance requirements. In order to obtain these indicators, we must adopt another way to complete, and BASE is the solution to this problem

6.6 Distributed + Cluster Overview

Distributed system

A distributed system consists of multiple computers and communicating software components connected by computer networks (local or wide area networks). A distributed system is a software system built on a network. Because of the characteristics of software, distributed systems have a high degree of cohesion and transparency. As a result, the difference between networks and distributed systems is more about high-level software (especially operating systems) than hardware. Distributed system can be applied in different platforms such as Pc, workstation, LAN and WAN.

In simple terms: 1. Distributed: different service modules (projects) are deployed on different servers. They communicate and call through Rpc/Rmi to provide services externally and collaborate within the group.

2 Cluster: The same service modules are deployed on multiple servers. Distributed scheduling software is used to centrally schedule services and provide external access.