This article was first published in 51CTO technology stack public number author Chen Caihua article reprint exchange please contact caison@aliyun.comCopy the code

With the advent of the era of big data, more and more websites and application systems need to support massive data storage, high concurrent requests, high availability, high scalability and other characteristics, the traditional relational database has been unable to cope with these adjustments, exposing many problems that can be overcome. As a result, various NoSQL (Not Only SQL) databases are developing rapidly as a powerful supplement to traditional relational data.

This article will analyze the existing problems related to traditional database, and several categories of NoSQL how to solve these problems, hoping to provide you with different business scenarios, about the selection of storage technology to provide reference.

1 Disadvantages of traditional databases

  • High I/O in big Data scenarios Because data is stored in rows, relational databases read the entire row of data from the storage device to the memory even if operations are performed on only one column, resulting in high I/O levels

  • Row records are stored and data structures cannot be stored

  • To modify the table structure, you need to execute the Data Definition Language (DDL) and modify the statement. During the modification, the table is locked and some services are unavailable

  • The full-text search function is weak. Relational databases can only match substrings. When the table data grows larger, the matching of like queries will be very slow, even in the case of indexes. And relational databases should not index text fields

  • Many applications need to understand and navigate relationships between highly connected data to enable use cases such as social applications, recommendation engines, fraud detection, knowledge graphs, life sciences, and IT/ networking. However, traditional relational databases are not good at dealing with the relationship between data points. Their tabular data models and strict schemas make it difficult to add new or different kinds of relational information.

2 NoSQL solution

NoSQL, which generally refers to non-relational databases, can be understood as a powerful complement to SQL.

NoSQL in many aspects of performance is much better than the non-relational database at the same time, often accompanied by some characteristics of the loss, the more common is the loss of transaction library transaction function. The four basic elements of database transaction execution are: ACID:

The name of the describe
A Atomicity

All operations in a transaction either complete or do not complete and do not end somewhere in between.

If a transaction fails during execution, it is rolled back to the state before the transaction began, as if the transaction had never been executed.
C Consistency

The integrity of the database is not compromised before and after a transaction.
I Isolation

The ability of a database to allow multiple concurrent transactions to read, write and modify data simultaneously. Isolation prevents data inconsistencies due to cross-execution when multiple transactions are executed concurrently.
D Durability

After a transaction, changes to the data are permanent and will not be lost even if the system fails.

The following five categories of NoSQL data provide solutions to the disadvantages of traditional relational databases:

2.1 Column database

Columnar database is a database for data storage based on columnar storage architecture, which is mainly suitable for batch data processing and instant query. Corresponding to the row database, data is allocated space by row related storage architecture, mainly suitable for small batch data processing, often used for online transactional data processing.

The columnar storage feature based on columnar database can solve the problem of high I/O of relational database in some specific scenarios

2.1.1 Basic Principles

The traditional relational database is to store the database according to the row, called “row database”, and the column database is to store the data according to the column.

There are two ways to put a table into a storage system, and most of us use row storage. Row storage puts rows into contiguous physical locations, much like a traditional record and file system. Column storage is a method of storing data in columns to a database, similar to row storage. The following is a graphical illustration of the two storage methods:

2.1.2 Common Column Databases

  • HBase

HBase is an open source non-relational distributed database (NoSQL) based on Google BigTable and implemented in Java. It is part of the Apache Software Foundation’s Hadoop project and runs on the HDFS file system to provide services for Hadoop on a similar scale to BigTable. As a result, it can store huge amounts of sparse data fault-tolerant.

  • BigTable

BigTable is a compressed, high-performance, and highly scalable Data storage System based on the Google File System (GFS). It is used to store large-scale structured data and is suitable for cloud computing.

2.1.3 Related Features

The advantages are as follows:

  • Efficient storage utilization **

Because of the different algorithms invented for the data characteristics of different columns, the column database often has much higher compression rate than the row database. The general compression rate of the ordinary row database is about 3:1 to 5:1, and the compression rate of the column database is generally about 8:1 to 30:1. More commonly, data is compressed through dictionary tables: the following is what the table looks like. After the dictionary table is compressed, the strings in the table become numbers. Because each string appears only once in the dictionary, it can be compressed (sort of like Normalize and de-normalize, Denomalize)

  • High query efficiency

Reading the same column of multiple data is efficient because the columns are stored together and a single disk operation can read all specified columns of data into memory. The following diagram illustrates the benefits of column storage (and data compression) through the execution of a query

Perform the following operations: I. Find the corresponding number of the string in the dictionary table (only one string comparison is performed). Ii. Match a number in the list and set the position to 1. Iii. Bitwise operation is performed on the matching results of different columns to obtain the record subscripts that meet all conditions. Iv. Use this subscript to assemble the final result set.Copy the code
  • Suitable for aggregation operation

  • Suitable for large amounts of data rather than small data

Disadvantages are as follows:

  • Not suitable for scanning small amounts of data
  • Not suitable for random updates
  • Not suitable for real-time operations with deletions and updates
  • Single-row data is ACID, transactions do not support normal rollback for multi-row transactions, they support I(Isolation) Isolation, D(persistence), cannot guarantee A(Atomicity), C(Consistency) Consistency

2.1.4 Application Scenarios

HBase is used as an example.

  • Large amount of data (100S TB data) with fast random access requirements
  • Write intensive applications, with a large number of daily writes, and relatively small number of reads, such as IM history messages, game logs, etc
  • Applications that do not need complex query conditions To query data HBase supports only rowkey-based queries. For HBase, a single record or a small range of queries is acceptable. Large range of queries may affect performance due to distribution. Data models with complex table relationships
  • Applications with high performance and reliability requirements HBase has high availability because it has no single point of failure.
  • For applications with a large amount of data and unestimated growth, HBase supports elegant data expansion. Even if the data volume increases in a period of time, HBase horizontal expansion can be used to meet the requirements.
  • Store structured and semi-structured data

2.2 K-V database

A database that is stored using key-values and whose data is organized, indexed, and stored in key-value pairs.

KV storage is very suitable for data that does not involve too many data relationships, and it can effectively reduce The Times of reading and writing disks. Compared with SQL database storage, KV storage has better read and write performance, and can solve the problem that relational databases cannot store data structures.

2.2.1 Common K-V Databases

  • Redis

Redis is an open source, network-enabled, memory-based, optionally persistent key-value pair storage database written in ANSI C. Development of Redis was sponsored by Redis Labs beginning in June 2015, and from May 2013 to June 2015, its development was sponsored by Pivotal. Prior to May 2013, its development was sponsored by VMware. Redis is the most popular key-value pair storage database, according to, a monthly ranking website.

  • Cassandra

Apache Cassandra (commonly referred to as C* in the community) is an open source distributed NoSQL database system. Originally developed by Facebook to store data in simple formats such as inboxes, it combines the Data model of Google BigTable with the fully distributed architecture of Amazon Dynamo. Facebook open-source Cassandra in 2008, and since then, thanks to Cassandra’s scalability and performance, Adopted by Apple, Comcast,Instagram, Spotify, eBay, Rackspace, Netflix and other well-known sites, it has become a popular distributed structured data storage solution.

  • LevelDB

    LevelDB is a Key/Value Pair embedded database management system programming library developed by Google and distributed under the open source BSD license.

2.2.2 Related Features

Take Redis as an example:

  • High performance: Redis supports more than 10W OF TPS
  • Rich data types: Redis supports String, Hash, List, Set, Sorted Set, Bitmap and Hyperloglog
  • Rich features: Redis also supports publish/subscribe, notification, key expiration, and more

Disadvantages are as follows: Redis transactions do not support atomicity and persistence (A and D), and only support isolation and consistency (I and C). For Redis transactions, atomicity is not guaranteed. The normal operation of Redis is atomic

Most businesses do not need to strictly follow the ACID principle, such as live leaderboards, fan following, etc., and even if some data persistence fails, the business impact is minimal. Therefore, you need to design solutions based on service characteristics and requirements

2.2.3 Application Scenarios

  • Application Scenario Stores user information (such as sessions), configuration files, parameters, and shopping carts. This information is usually tied to an ID (key)
  • Not applicable Scenario
    • You need to query by value, not by key. There is no way through a Value query in a key-value database.
    • You need to store relationships between data. You cannot associate data with two or more keys in a key-value database
    • Transaction support is required. You cannot roll back a key-value database when a fault occurs.

2.3 Document Database

A document database (also known as a document database) is a database designed to store semi-structured data as documents. Document databases typically store data in JSON or XML format.

Thanks to the no-schema feature of the document database, arbitrary data can be stored and read.

Since the data format used is JSON or BSON, there is no need to define fields before using JSON data because it is self-describing. Reading a field that does not exist in JSON will not cause syntax errors like SQL, which can solve the problem of inconvenient expansion of schema of relational database table structure

2.3.1 Common Document Databases

  • MongoDB

MongoDB is a document-oriented database management system written in C++ to solve a number of real-world problems in the application development community. MongoDB was developed by the 10Gen team in October 2007. First launched in February 2009.

  • CouchDB

Apache CouchDB is an open source database focused on ease of use and being a “fully web-embracing database”. It is a NoSQL database that uses JSON as a storage format, JavaScript as a query language, and MapReduce and HTTP as apis. One notable feature is multi-master replication. The first version of CouchDB was released in 2005 and became an Apache project in 2008.

2.3.2 Related Features

The following uses MongoDB as an example

The advantages are as follows:

  • Simple new fields do not need to perform DDL statements to modify the table structure like a relational database, the program code directly read and write
  • Easy compatibility with historical data For historical data, even if there is no new field, will not cause an error, only a null value is returned, in this case, code compatibility processing
  • Easy to Store Complex data JSON is a powerful description language that can describe complex data structures

Compared with traditional relational database, document database mainly has weak transaction support for multiple data records, which is embodied as follows:

  • Atomicity only supports single-line/document-level Atomicity, not multi-line, multi-document, multi-statement Atomicity
  • Isolation The Isolation level supports only the Read Committed level, which may cause unrepeatable and illusory reads
  • Complex queries such as JOIN queries are not supported. To perform join queries, you need to perform multiple operations on the database

MongonDB also supports Consistency and Durability for multi-document transactions

While multi-document ACID transaction support is officially announced for MongoDB in version 4.0, it remains to be seen how it will be implemented.

2.3.3 Application Scenarios

Applicable scenarios:

  • It’s a lot of data or it’s going to be a lot of data
  • Table structure is not clear, and the field is increasing, such as content management system, information management system

Not applicable scenarios:

  • Transactions need to be added on different documents. Document-oriented databases do not support transactions between documents
  • Multiple documents directly require complex queries, such as joins

2.4 Full text search engine

Traditional relational databases mainly use indexes to achieve the purpose of fast query. In the business of full-text search, indexes are powerless, mainly reflected in:

  • The criteria for a full-text search can be permutations and combinations, and if they are met by indexes, the number of indexes is very large
  • The fuzzy matching method of full-text search can not meet the indexes, and can only use like query, and like query is the whole table scan, the efficiency is very low

The appearance of full-text search engine is to solve the problem of weak full-text search function of relational database

2.4.1 Basic Principles

The technical basis of a full-text search engine is inverted Index, an indexing method whose basic principle is to index words to documents. In contrast, there is “forward indexing”, the basic principle of which is to build an index of documents to words.

There is now the following collection of documents:

The forward index yields the following index:

As you can see, a forward index is suitable for querying the content of a document by its name

A simple inverted index is as follows:

The inverted index with word frequency information is as follows:

As you can see, inverted indexes are suitable for querying document content by keyword

2.4.2 Common full-text search engines

  • Elasticsearch

    Elasticsearch is a Lucene based search engine. It provides a distributed, multi-tenant – capable of full-text search with an engine HTTP Web interface and unstructured JSON files. Elasticsearch was developed in Java and released as open source under the terms of the Apache License. Elasticsearch is the most popular enterprise search engine according to DB-Engines, followed by Lucene-based Apache Solr.

  • Solr

    Solr is the open source enterprise search platform for the Apache Lucene project. Its main functions include full text retrieval, hit marking, faceted search, dynamic clustering, database integration, and rich text (such as Word, PDF) processing. Solr is highly extensible and provides distributed search and index replication

2.4.3 Related Features

Take Elasticsearch as an example. The advantages are as follows:

  • High query efficiency Near-real-time processing of massive data
  • Scalability Based on the cluster environment, it facilitates horizontal expansion and supports PB-level data
  • Highly available Elasticsearch Cluster elasticity – they will find new or failed nodes, reorganize and rebalance data to ensure data is secure and accessible

Disadvantages are as follows:

  • ACID Supports data less than A single document is ACID, transactions containing multiple documents do not support normal rollback of transactions, supports I(Isolation) Isolation (based on optimistic locking mechanism), D(persistence), does not support A(Atomicity), C (Consistency) Consistency
  • Weak support for complex multi-table association operations like those in databases via foreign keys
  • There is a certain delay in reading and writing. The data written can be retrieved at the earliest 1s
  • The update performance is low. The underlying implementation is to delete data first and then insert new data
  • The memory footprint is large because Lucene loads the index part into memory

2.4.4 Application Scenarios

The application scenarios are as follows:

  • Distributed search engine and data analysis engine
  • Full text retrieval, structured retrieval, data analysis
  • Massive data can be distributed to multiple servers for storage and retrieval by near real time processing

Not applicable to the following scenarios:

  • Data needs to be updated frequently
  • Complex associated query is required

2.5 Graph Database

Graph database uses graph theory to store relational information between entities. The most common example is the relationship between people in social networks. The effect of relational database for storing “relational” data is not good, its query is complicated, slow and beyond expectation, and the unique design of graph database just makes up for this defect, and solves the problem that relational database is weak in storing and processing complex relational data.

2.5.1 Common Graphics Databases

  • Neo4j

Neo4j is developed by Neo4j, Inc. Graphic database management system developed. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, Neo4j is the most popular graph database according to DB-Engines rankings.

  • ArangoDB

ArangoDB is a native multi-model database system developed by triAGENS GmbH. The database system supports three important data models (key/value, document, graph), which contain a database core and the unified query language AQL (ArangoDB Query Language). The query language is declarative, allowing different data access patterns to be combined in a single query. ArangoDB is a NoSQL database system, but AQL is similar to SQL in many ways.

  • Titan

Titan is an extensible graphics database optimized for storing and querying graphs containing tens of billions of vertices and edges distributed across multi-machine clusters. Titan is a transactional database that enables thousands of concurrent users to perform complex graph traversals in real time.

2.5.2 Related Features

Take Neo4j as an example:

Neo4j uses the concept of graph in data structure for modeling. The two most basic concepts in Neo4j are nodes and edges. Nodes represent entities, and edges represent relationships between entities. Nodes and edges can have their own attributes. Different entities are connected through various relationships to form a complex object graph.

For relational data, the two 2 databases have different storage structures:

In Neo4j, “index-free adjacency” is used when storing nodes, that is, each node has a pointer to its neighbor node, which enables us to find the neighbor node within O(1) time. In addition, according to the official statement, edge is the most important entity in Neo4j as “first-class entities”, so it can be stored separately, which is conducive to improving the speed of graph traversal and facilitating traversal in any direction

The following advantages:

  • The traversal of high performance graph is a unique algorithm of graph data structure, that is, starting from a node, according to the relation of its connection, it can quickly and easily find its neighboring nodes. This method of finding data is independent of the size of the data, because adjacent queries always look for limited local data and do not search the entire database

  • Flexibility of design The natural stretching nature of data structure and its unstructured data format allow graph database design to have great scalability and flexibility. Since nodes, relationships, and attributes are added as requirements change, they do not affect the normal use of the original data

  • Agile development An intuitive data model that looks exactly the same, if not the same, from the beginning of the requirements discussion, through the development and implementation of the program, and finally stored in the database

  • Unlike other NoSQL databases, Neo4j also has full transaction management features that fully support ACID transaction management

Disadvantages are as follows:

  • There is a limit to the number of nodes, relationships, and attributes that can be supported
  • No split support

2.5.3 Application Scenarios

The application scenarios are as follows:

  • In some relational data, such as social networks
  • Recommendation engines. If we present the data in the form of graphs, it will be very beneficial to the formulation of recommendations

Not applicable to the following scenarios:

  • Log large amounts of event-based data (such as log entries or sensor data)
  • Large scale distributed data processing, similar to Hadoop
  • Suitable for structured data stored in a relational database
  • Binary data store

3 summary

Relational database and NoSQL database selection, often need to consider several indicators:

  • The amount of data
  • concurrency
  • The real time
  • Consistency requirement
  • Read and write distribution and types
  • security
  • Operational costs

Common software system database selection reference is as follows:

  • For internal management systems such as operation systems, relational systems are preferred because of the small amount of data and concurrency
  • Large flow system such as e-commerce single product page, the background to consider the selection of relational, the foreground to consider the selection of memory
  • In log system, column format is considered for raw data and inverted index is considered for log search
  • Search system such as site search, non-general search, such as commodity search, the background to consider the selection of relational, the front to consider the selection of inverted index
  • Transactional systems such as inventory, transactions, bookkeeping, consider the relational + cache + consistency protocol
  • For off-line computing such as bulk data analysis, consider column or relational options
  • Real-time computing real-time monitoring, memory or column database can be considered

In the design practice, it should be based on requirements and business-driven architecture. No matter RDB/NoSQL/DRDB is selected, it must be demand-oriented, and the final data storage scheme must be a comprehensive design of various trade-offs


Learning architecture from scratch — Alibaba’s Li Yunhua

No rambling

Graphic database Neo4j development practice

Nine key-value storage databases in the era of Big data

Transactions – Redis official documentation

How does MongoDB implement transaction ACID?

MySQL dirty read, virtual read, phantom read

A comprehensive overview of the use of relational databases and NoSQL scenarios

Analysis of the characteristics of column database

Understand column and row databases in one minute

HBase Basic Concepts

NoSQL Databases, why we should use, and which one we should choose

Traditional relational database and distributed database knowledge