The five pieces mentioned here refer to relational database, index database, sequential database, document database and cache database.



The figure above shows an example of a set of read and write services with these five types of databases:

1. This is just to show that we can use these types of databases in this way, not to say that all of our applications need these types of databases.

2. The synchronous write service is responsible for landing and caching important data in the first time.

3. The asynchronous write service listens to MQ to sense changes in the data and then re-reads the latest data to write the data to other secondary data sources, such as document and indexed databases, writing back a state in the cache if necessary.

4. A special data query service is used to route data according to requirements and read data from different data sources according to requirements and performance factors.

5. Data aggregation service further reads data from secondary data sources according to requirements for time dimension aggregation, and aggregates data into time series database for monitoring query service query.

Let’s talk more about these storage systems.

Relational database

There is no doubt that relational databases such as MySQL are the most reliable for writing strongly transactional data, and with the use of SSDS, relational databases can easily reach 10,000 QPS. For applications with large data volume and large concurrency, the data volume of a single table is over ten million with tens of thousands of QPS, which is difficult to be supported by a single database. We need to Sharding the data table, for example, divide the data into 128 data tables according to certain dimensions, and then divide them into 8 or even 16 data clusters. So each instance of MySQL is only 1/8 or 1/16 of the request load and the data volume is much smaller. The problem with this is that we need to modify the application so that it can only query the sliced table according to certain criteria. Without criteria or any criteria, we can’t know which table and which instance the data is actually stored on.

This is indeed a troublesome place, our query conditions may have more than a dozen, only according to one dimension to query can not meet our needs. One compromise is that we introduce what’s called the Index table, which means that while writing the actual complete data to Sharding’s table, we write the fields in the table that we need to query into a special Index table that has not been Sharding. This table stores almost no data of vARCHAR type, all are various bigInt business ids or tinyint state, and time. Because the table is very close, the table space can be almost contained in the database cache despite the large number of data entries, so the performance is much higher. Conditional queries with very strong real-time requirements can be queried from this data table. The data after Sharding can only be used for query by ShardKey.

Cache database

Redis is the most commonly used distributed cache solution, which is used in almost any Internet application. The characteristics are:

1. It can persist data, but in my opinion, it is better to cache database only as cache, to be able to bear the risk of losing data, otherwise it may die ugly. There are also some accidents due to RDB or master slave replication.

2. Rich data structures must be utilized. Rich data structures mean that you can rely on rich APIS to perform complex operations on the server side, and the performance is much higher than the efficiency of deserialization after extraction, and then serialization and storage. Sometimes it’s even possible to combine these data structures and apis together to crash into a brilliant solution to implement a high-performance business logic in a very efficient way. Check out Redis in Action.

3. Super high performance (of course, with some clustering solutions such as CODIS) is enough to resist direct access to any business requests, most of the time the cache solutions hang because of various reasons through the cache rather than Redis file failure.

4. Rich clustering and high availability solutions and a variety of practical features (pipes, transactions, Lua scripting), 5.0 also introduced the Stream feature to replace the little-known Disque worth paying attention to.

So Redis is also widely used:

· Data cache

· Distributed lock

· Message queue

· Server computing

In the architecture above, we double write to the database and cache using synchronous write services. The purpose is to have fresh hot data in the cache. Whether internal or external, single-data queries can be routed directly to the cache.

Document database

The representative of document database is Mongodb, which has been working for many years. I have used Mongodb several times in some non-important business scenarios. My evaluation is as follows (I have not touched Mongodb in the last year or more, and may be biased) :

1. Ultra-high write performance and very good read performance (it cannot be compared with Redis and its nature is different). When the amount of data increases, the performance may deteriorate greatly.

2. Because the storage is a document, it is weak structure. It is very appropriate to store some data that cannot be determined in advance, and any index can be added to search the data needed in the future. One useful scenario is as a crawler’s data source, where data is variable and less important, and write performance is important.

3. It is not reliable and stable and may lose data. Therefore, it is strongly recommended not to use the database as a core data store but as a bypass database for non-critical services. For instance in the architecture diagram above, we may get the core data from other place to fill some data and then again after proper treatment, save the mongo database as a monitor or backend database oriented to use (MEAN one suite, you can imagine for simple applications with scripting languages more comfortable with it). If you hang, you hang. If you don’t hang, you can share a lot of MySQL pressure.

4. There are many ways of playing, such as Sharding, replication and clustering, but with the increase of data volume, operation and maintenance may be a big pit. It is likely to encounter the situation that the whole cluster is destroyed and cannot be started, and data recovery takes a long time. The use of memory is quite crazy, and the use of hardware always feels cheap.

Indexed database

ElasticSearch has been a dark horse in recent years as its representative. ELK cluster is used by major Internet companies. As long as the cluster is properly configured, hundreds of thousands of writes per second is not a big problem. After all, thorough distributed theory can have unlimited high write capacity. The characteristics of ES are as follows:

1. Very rich query API, not only full-text index query, ordinary query API rich and diverse, combined together can complete a variety of business logic in the server, basically SQL+MySQL can achieve, ES query can be achieved, but also more powerful full-text search. Of course, the syntax of the query is slightly obscure certainly no SQL to the straight hang.

2. Similar to schema-free of Mongodb, there is no need to implement the definition of table structure.

3. Still powerful write and read ability, of course, if there are many indexes, the efficiency of writing documents will certainly be reduced. This is why the writing to ES in the figure is done by a dedicated asynchronous process.

4. The natural distributed configuration of ES determines that after writing billions or billions of data, it can complete a complex query with multiple conditions in a fairly acceptable time (say 10 seconds). For MySQL, such a query may take 10 minutes or even 100 minutes to execute, which is totally unacceptable.

5. ES has good query support for nested data. After testing, we tend to directly store the data associated with multiple labels as a large nested JSON by flattening them. Then the nested JSON data can be easily queried using the query API.

Because of these characteristics, in this architecture diagram, we also use ES as the data source of the query service. For queries that meet the following conditions, we can use ES:

· Insensitive to data delay and can accept not finding fresh data for a period of time

· The query is very complex or full-text search, so RouteKey after Sharding cannot be used, and the Index table cannot meet the requirements

· The result of the query is not only the data of a single table, but also relatively rich data. The database query needs to query multiple tables for multiple times

The underlying storage structure of index database and document database is completely different. Although there are many people using ES to completely replace Mongodb, I think ES is suitable for storing a larger amount of data than Mongodb, and IT cannot be played without distributed use. Mongodb is also suitable for non-sharding storage of medium-sized data.

Sequential database

InfluxDb is a representative temporal database. For Group By query By time period, no matter ES, MySQL or Mongodb support the API level of course, but the query efficiency is terrible. Therefore, for requirements such as the following, a sequential database can be considered:

· Monitoring chart

· Aggregate by time dimension

· The time dimension of the query can span a long time

· Regular filing is required

If using traditional schemes, we tend to be at a fixed time dimension to aggregate saving data, if we want to check for 1 hour and 1 year dimension, use 5 seconds of aggregate particle size is obviously not appropriate, we need to write data in time according to different granularity aggregation, needs some work, using time series database can be less such troubles. Moreover, the performance of data written by databases such as InfluxDb is very high, comparable to that of Redis. A single node can even withstand the write of 100,000 indicators, basically meeting the requirements of most application scenarios. For the monitoring of some business indicators, the dosing of business events and the aggregation of the time dimension of business data, we can completely consider introducing a special timing database.

To sum up, the architecture diagram here reflects only a few important ideas:

1. Use a dedicated service to write and read data to facilitate routing.

2. Reasonably plan the Sharding method, and think of a full set of query schemes for RDBMS after Sharding.

3. Data writing distinguishes synchronous writing of primary data sources from asynchronous writing of secondary data sources, making the main process faster.

4. Make reasonable use of the characteristics of different data sources and use them in combination to give full play to their advantages and avoid shortcomings.

5. Data processing can be a hierarchical relationship, which can be processed by specialized business middleware.

6. Think twice about using a database other than an RDBMS as the primary core storage engine.

7. The use of rich data sources means an increase in maintenance costs, and the problem of data inconsistency is inevitable. We need to consider whether we can accept data inconsistency at a certain level.