Consider: Does NoSQL require data modeling?

Scenario 1: Suppose we now need to save employee information and department information for an enterprise. Should we choose to embed the full department information in a table or just the primary key ID? Scenario 2: Suppose we have a batch of sensors reporting temperature and humidity monitoring information every minute, and the back-end needs to provide some aggregated query scenarios. How should we model?Copy the code

One, database three paradigm

1. Definition of the three paradigms

(1) First Normal Form (1NF) : Each column of a database table is an indivisible atomic item.

The address column can be divided into four fields: province, city, district and detailed address.

Serial number The name gender age address
1 Zhang SAN male 18 Section 1 wanshun Road, Shuangliu District, Chengdu City, Sichuan Province
2 Li si female 17 XXX Road, Jingyang District, Deyang City, Sichuan Province

The first paradigm requires columns to be partitioned down to the smallest granularity possible. You want to eliminate the need to store multiple values in a single column, and each column can be queried independently.

For this item to be based on the actual needs, such as you each query return the entire address is not necessary to split.

(2) Second normal Form (2NF) : each table must have only one primary key, and other attributes must be completely dependent on the primary key. In addition to the primary key, there are also defined partial dependencies on the primary key that are not allowed.

For example, if you want to design an order information table, because there may be multiple items in the order, use the order number and item number as the joint primary key of the database table, as shown in the following table.

The order no. Product id Name of commodity The number of unit The price Commodity categories Logistics information
001 1 cleanser 1 a 20 Wash protect Zhongtong express
001 2 toothpaste 2 a 15 Wash protect YunDa Courier
002 3 Electric fan 1 a 200 Electrical appliances Jingdong logistics

In the case of the item category attribute, we assume that it only depends on the item number, that is, only on a part of the primary key, so this is a violation of the second normal form. An improvement would be to store the product category in the product information sheet.

(3) Third normal Form (3NF) : each column in the data table is directly related to the primary key, but not indirectly.

For example, in the figure above, the product name, quantity, unit, and category are directly related to the product number and indirectly related to the order number, so it does not meet the third normal form. You should put all this information in the product information sheet.

2. Strengths and weaknesses of paradigms

(1) The paradigm design eliminates redundancy and therefore requires less space. In addition, formalized tables are easier to update, which helps to ensure data consistency. However, the disadvantage is that associated queries are slow, and some queries require multiple lookups in the database. If only disk operations are considered, it is equivalent to adding random I/O to the disk, which is expensive.

(2) The anti-paradigm design can generally optimize the read performance. MongoDB rarely uses the associated query of the database, so the number of calls (network I/O) between the client and the database can be reduced by nested design. In addition, using embedding gives you a guarantee of atomicity for writing data, that is, either complete success or complete failure.

Second, model design

When designing data modeling, we should consider various factors, such as read or write, data query method, performance of database itself, etc., and make tradeoff among various factors to find the most suitable design for our own business scenarios.

1. Embedded design

Embed related data into a single structure or document. MongoDB actually encourages the use of embedded design whenever possible. In MongoDB, writes are atomic at the individual document level, even if the operation modifies multiple embedded documents within a single document. When a single write operation modifies multiple documents (for example, db.collection.updatemany ()), the changes to each document are atomic, but the operation as a whole is not.

Typically, embedded data models are used when:

  • There is an “contain” relationship between entities. (One to one)

  • There is a one-to-many relationship between entities. In these relationships, “multiple” or child documents always appear with or are viewed in context with “one” or parent document. (One to many)

In general, embedding provides better performance for read operations and the ability to request and retrieve related data within a single database operation. The embedded data model ensures atomicity of individual document updates.

(1) Embedded documents

Inline documentation can be used for one-to-many relationships, as follows:

{
   "_id": "joe"."name": "Joe Bookreader"."addresses": [{"street": "123 Fake Street"."city": "Faketon"."state": "MA"."zip": "12345"
       },
       {
         "street": "1 Some Other Street"."city": "Boston"."state": "MA"."zip": "12345"}}]Copy the code

The embedded design improves query performance, and the desired information can be obtained in a single query. But there are limits:

  • Embedded documents cannot be added indefinitely

  • Single document size cannot operate 16M (save memory, bandwidth)

See: docs.mongodb.com/manual/refe…

(2) Embedded reference

An inline reference is another form of an inline document that only preserves the subdocument ID, not the entire field, in the inline document. As follows:

// User information
{
    "_id": "joe"."name": "Joe Bookreader"."addresses": [
        1.2]}// Address information[{"id": 1."street": "123 Fake Street"."city": "Faketon"."state": "MA"."zip": "12345"
    },
    {
        "id": 2."street": "1 Some Other Street"."city": "Boston"."state": "MA"."zip": "12345"}]Copy the code

The first query is mainly to get the address ID set, and then query all address information in the address table with $IN, so that the query performance is not poor.

The main reasons for using embedded documents:

  • Embedded documents are large and can grow indefinitely or exceed the 16M limit

  • Embedded documents update frequently, and embedded references do not update all related documents because of subdocument updates

(3) Reference mode

The reference pattern is similar to a foreign key in that one field is typically associated with another table as a reference.

The following is a reference to a one-to-one relationship:

/ / the student table
{
  "id":1."name":"Zhang"."class":1
}

/ / the class table
{
  "id":1."name":"Class 1, Grade 1"."teacher":"Miss Zhang"
}
Copy the code

The following many-to-many relationship is expressed:

/ / the teacher table[{"id":1."name":"Miss Zhang"
	},
  {
    "id":2."name":"Miss Li"}]/ / table[{"id":1."Chinese"
  },
  {
    "id":2."Mathematics"
  },
 {
    "id":3."English"}]// Teacher - subject relationship table[{"techer_id":1."subject_id":1
  },
  {
    "techer_id":1."subject_id":2}]Copy the code

Reasons for choosing the reference pattern:

  • Associated documents are very large or grow without limit. For example, a list of comments on Weibo.
  • The hierarchy of business entity relationships is complex
  • Many-to-many references are preferred
  • High data consistency requirements, avoiding redundant data scenarios
(4) Subset mode

If we query data most of the time without needing all of the data embedded in the document, this unnecessary data can put extra load on the server and slow down read operations. At this point, you can use the subset schema to retrieve the subset of data that is most frequently accessed in a single database call.

The following application displays movie information:

{
  "_id": 1."title": "The Arrival of a Train"."year": 1896."runtime": 1."released": ISODate("01-25-1896"),
  "poster": "http://ia.media-imdb.com/images/M/MV5BMjEyNDk5MDYzOV5BMl5BanBnXkFtZTgwNjIxMTEwMzE@._V1_SX300.jpg"."plot": "A group of people are standing in a straight line along the platform of a railway station, waiting for a train, which is seen coming at some distance. When the train stops at the platform, ..."."fullplot": "A group of people are standing in a straight line along the platform of a railway station, waiting for a train, which is seen coming at some distance. When the train stops at the platform, the line dissolves. The doors of the railway-cars open, and people on the platform help passengers to get off."."lastupdated": ISODate("2015-08-15T10:06:53"),
  "type": "movie"."directors": [ "Auguste at"."Louis at"]."imdb": {
    "rating": 7.3."votes": 5043."id": 12
  },
  "countries": [ "France"]."genres": [ "Documentary"."Short"]."tomatoes": {
    "viewer": {
      "rating": 3.7."numReviews": 59
    },
    "lastUpdated": ISODate("2020-01-09T00:02:53")}}Copy the code

Currently, the movie collection contains several fields that the application does not need to display a simple overview of the movie, such as Fullplot and rating. Instead of storing all the movie data in a single collection, you can split the collection into two collections:

The movie collection contains basic information about the movie. This is the data the application loads by default:

// movie collection

{
  "_id": 1."title": "The Arrival of a Train"."year": 1896."runtime": 1."released": ISODate("1896-01-25"),
  "type": "movie"."directors": [ "Auguste at"."Louis at"]."countries": [ "France"]."genres": [ "Documentary"."Short"],}Copy the code

The movie_details collection contains additional, less frequently accessed data for each movie:

// movie_details collection

{
  "_id": 156."movie_id": 1.// reference to the movie collection
  "poster": "http://ia.media-imdb.com/images/M/MV5BMjEyNDk5MDYzOV5BMl5BanBnXkFtZTgwNjIxMTEwMzE@._V1_SX300.jpg"."plot": "A group of people are standing in a straight line along the platform of a railway station, waiting for a train, which is seen coming at some distance. When the train stops at the platform, ..."."fullplot": "A group of people are standing in a straight line along the platform of a railway station, waiting for a train, which is seen coming at some distance. When the train stops at the platform, the line dissolves. The doors of the railway-cars open, and people on the platform help passengers to get off."."lastupdated": ISODate("2015-08-15T10:06:53"),
  "imdb": {
    "rating": 7.3."votes": 5043."id": 12
  },
  "tomatoes": {
    "viewer": {
      "rating": 3.7."numReviews": 59
    },
    "lastUpdated": ISODate("2020-01-29T00:02:53")}}Copy the code

This approach improves read performance because the program needs to read less data to satisfy common requests. If necessary, additional database queries are performed to retrieve less frequently accessed data.

Tip:

At design time, we should put the most frequently accessed data in the main document. However, if all data is needed each time, this pattern is not suitable. Again, it depends on the usage scenario.

2. Tree structure design

(1) Parent reference model tree structure

The parent reference pattern stores each tree node in a document. In addition to the tree node, the document stores the parent ID of that node.

Look at the following example:

db.categories.insertMany( [
   { _id: "MongoDB", parent: "Databases" },
   { _id: "dbm", parent: "Databases" },
   { _id: "Databases", parent: "Programming" },
   { _id: "Languages", parent: "Programming" },
   { _id: "Programming", parent: "Books" },
   { _id: "Books", parent: null }
] )
Copy the code

Query the parent node of a node:

db.categories.findOne( { _id: "MongoDB" } ).parent
Copy the code

Create index for parent field:

db.categories.createIndex( { parent: 1 } )
Copy the code

Query child nodes of parent node:

db.categories.find( { parent: "Databases" } )
Copy the code
(2) Sub-reference model tree structure

The subreference pattern stores each tree node in a document. In addition to tree nodes, an array is used in the document to hold child nodes of the node.

Take an example:

db.categories.insertMany( [
   { _id: "MongoDB", children: [] },
   { _id: "dbm", children: [] },
   { _id: "Databases", children: [ "MongoDB", "dbm" ] },
   { _id: "Languages", children: [] },
   { _id: "Programming", children: [ "Databases", "Languages" ] },
   { _id: "Books", children: [ "Programming" ] }
] )
Copy the code

Query the child node of a node:

db.categories.findOne( { _id: "Databases" } ).children
Copy the code

Create index for children field:

db.categories.createIndex( { children: 1 } )
Copy the code

Query the parent node of a child node:

db.categories.find( { children: "MongoDB" } )
Copy the code

You can use the subreference pattern as long as you don’t operate on the subtree. This pattern can also provide a suitable solution for graphical structures where storage nodes may have multiple parents.

(3) Model tree structure with ancestor array

The ancestor array pattern stores each tree node in a document. In addition to tree nodes, the document stores the id of the ancestor or path of the node in an array.

The following example models a tree using an ancestor array. In addition to the ancestor field, these documents store references to the immediate parent category in the parent field:

db.categories.insertMany( [
  { _id: "MongoDB", ancestors: [ "Books", "Programming", "Databases" ], parent: "Databases" },
  { _id: "dbm", ancestors: [ "Books", "Programming", "Databases" ], parent: "Databases" },
  { _id: "Databases", ancestors: [ "Books", "Programming" ], parent: "Programming" },
  { _id: "Languages", ancestors: [ "Books", "Programming" ], parent: "Programming" },
  { _id: "Programming", ancestors: [ "Books" ], parent: "Books" },
  { _id: "Books", ancestors: [ ], parent: null }
] )
Copy the code

Query the ancestor node or path node of a node:

db.categories.findOne( { _id: "MongoDB" } ).ancestors
Copy the code

Create an index for the rooted field:

db.categories.createIndex( { ancestors: 1 } )
Copy the code

Find all descendants of the ancestor node:

db.categories.find( { ancestors: "Programming" } )
Copy the code

The ancestor array pattern finds the descendants and ancestors of a node by creating an index on the elements of the ancestors field. This mode can quickly find subtrees.

(4) Materialized path model tree structure

The materialized path pattern stores each tree node in a document. In addition to tree nodes, the document stores the id of the ancestor or path of the node as a string. Although the materialized path pattern requires additional steps to work with strings and regular expressions, it provides greater flexibility for working with paths, such as finding nodes through partial paths.

Example:

db.categories.insertMany( [
   { _id: "Books", path: null },
   { _id: "Programming", path: ",Books," },
   { _id: "Databases", path: ",Books,Programming," },
   { _id: "Languages", path: ",Books,Programming," },
   { _id: "MongoDB", path: ",Books,Programming,Databases," },
   { _id: "dbm", path: ",Books,Programming,Databases," }
] )
Copy the code

Query retrieves the entire tree, sorted by field path:

db.categories.find().sort( { path: 1 } )
Copy the code

Descendants of query Programming using regular expressions:

db.categories.find( { path: /,Programming,/ } )
Copy the code

Query Books’ descendants:

db.categories.find( { path: /^,Books,/ } )
Copy the code

Create index on path field:

db.categories.createIndex( { path: 1 } )
Copy the code

This index still needs to satisfy the left-most matching principle. /,Books,/ or /,Books,Programming,/ from the root node can improve query performance, otherwise it may not be effective.

(5) Nested set model tree structure

The mode is less common for specific clicks

3. The barrel model

Bucket mode is a way of aggregating multiple documents with certain relationships into one document according to a dimension factor (usually time). The concrete implementation can be achieved by using MongoDB embedded documents or arrays.

Bucket mode is ideal for Internet of Things (IoT), real-time analytics, and time series data scenarios.

Example: Suppose we now need to collect data from sensors that collect temperature and humidity and report it every minute. The back end is responsible for storage and provides queries and some statistics.

(1) Traditional way

As a general rule, we can easily think of the following data structure:

> db.sensor.find().pretty()
{
	"_id" : ObjectId("60e319d4803e0e77e67b5e9e"),
	"sensor_od" : "SENSOR-1"."temperature" : 20.36."humidity" : 0.67."created_time" : "The 2021-07-05 10:01:00"
}
{
	"_id" : ObjectId("60e319d4803e0e77e67b5e9f"),
	"sensor_od" : "SENSOR-2"."temperature" : 21.36."humidity" : 0.87."created_time" : "The 2021-07-05 10:01:00"
}
Copy the code

Every time data is reported, data is written:

> db.sensor.insertOne({"sensor_od":"SENSOR-3"."temperature":21.36."humidity":0.67."created_time":"The 2021-07-05 10:01:00"})
Copy the code

If we want to query the low-end time temperature and humidity of a sensor:

> db.sensor.find({"sensor_od":"SENSOR-1"."created_time": {$gte:"The 2021-07-05 10:00:00".$lt:"The 2021-07-05 11:00:00"{}})"_id" : ObjectId("60e319d4803e0e77e67b5e9e"), "sensor_od" : "SENSOR-1"."temperature" : 20.36."humidity" : 0.67."created_time" : "The 2021-07-05 10:01:00" }
{ "_id" : ObjectId("60e31b15803e0e77e67b5ea1"), "sensor_od" : "SENSOR-1"."temperature" : 23.36."humidity" : 0.57."created_time" : "The 2021-07-05 10:02:00" }
{ "_id" : ObjectId("60e31b20803e0e77e67b5ea2"), "sensor_od" : "SENSOR-1"."temperature" : 25.36."humidity" : 0.67."created_time" : "The 2021-07-05 10:03:00" }
Copy the code

To speed up the query, create a federated index:

db.sensor.createIndex({"sensor_id":1."created_time":1})
Copy the code

This method is easy to read and write, but over time, the number of documents written can be very large. Assuming we have 100 sensors, after a month sensor documents will reach 4.32 million, and 10,000 sensor data will exceed 400 million.

(2) Buckets are divided by time

In fact, after thinking, we can find that most of the time, we rarely look at the value of a single data report, and we pay more attention to the mean value and the trend of the value over a period of time.

We are now going to the barrels by the day, by the hour:

  • In units of 1 day, each hour is expressed as “0,1,2… 23”, a total of 24 scales.

  • In units of one hour, each minute is represented as “0,1,2… 59”, a total of 60 scales.

The final document looks like this:

{
    "sensor_id": "SENSOR-1"."data": {/ / 1
    	"0": {/ / 00:00
        	"0": {"temperature": 21.36."humidity": 0.67
            },
            "1": {"temperature": 22.36."humidity": 0.57
            },
        		/ / 00:02
            "2": {"temperature": 23.36."humidity": 0.37
            }, 
            "59": {"temperature": 25.36."humidity": 0.37}},"1": {"0": {"temperature": 21.36."humidity": 0.67
            },
            "1": {"temperature": 22.36."humidity": 0.57
            },
            "2": {"temperature": 23.36."humidity": 0.37}},/ / 23 points
        "23": {"0": {"temperature": 21.36."humidity": 0.67
            },
            "1": {"temperature": 22.36."humidity": 0.57
            },
            "2": {"temperature": 23.36."humidity": 0.37}}},// The time is rounded by day
    "created_time":"The 2021-07-05 00:00:00"  
}
Copy the code

The data of such a sensor for a day is placed in a document, so querying the data for a day looks like this:

db.sensor_bucket.find({"sensor_id":"SENSOR-1"."created_time":"The 2021-07-05 00:00:00"})
Copy the code

If you need to query data at a certain time in a day:

> db.sensor_bucket.find({"sensor_id":"SENSOR-1"."created_time":"The 2021-07-05 00:00:00"}, {"sensor_id":1."created_time":1."Data. 23.1":1}).pretty()


{
	"_id" : ObjectId("60e31fd6803e0e77e67b5ea3"),
	"sensor_id" : "SENSOR-1"."data" : {
		"23" : {
			"1" : {
				"temperature" : 22.36."humidity" : 0.57}}},"created_time" : "The 2021-07-05 00:00:00"
}
Copy the code

If you want to query data for a period of time within a day:

> db.sensor_bucket.find({"sensor_id":"SENSOR-1"."created_time":"The 2021-07-05 00:00:00"}, {"sensor_id":1."created_time":1."Data. 0.0":1."Data. 0.1":1."Data. 0.2":1}).pretty()

{
	"_id" : ObjectId("60e31fd6803e0e77e67b5ea3"),
	"sensor_id" : "SENSOR-1"."data" : {
		"0" : {
			"0" : {
				"temperature" : 21.36."humidity" : 0.67
			},
			"1" : {
				"temperature" : 22.36."humidity" : 0.57
			},
			"2" : {
				"temperature" : 23.36."humidity" : 0.37}}},"created_time" : "The 2021-07-05 00:00:00"
}
Copy the code

Writing to data becomes a document update:

db.sensor_bucket.updateOne({
	{
    "sensor_id": "SENSOR-1"."created_time": "The 2021-07-05 00:00:00"
	},
	{
    "$set": {
      "Data. 0.3": {
        "temperature": 23.56."humidity": 0.67}}}, {"upsert": true}})Copy the code
(3) prepolymerization

As the way we store data changes, so does the way we read it. If we want to perform aggregation operations for certain periods, such as the average temperature from 15:00 to 16:00, we can do it in various ways:

  • Query the time period data in the document and make statistics in the program.

  • Using the method of preaggregation, the result of the predicted calculation is written in advance. Such as creating a new statistics table or writing directly to the current document.

Preaggregation is ideal for scenarios with frequent statistics, where the data only needs to be computed once to satisfy subsequent queries.

(4) Contrast

In both cases, we can write a test program to write a month’s worth of data for 100 sensors and compare them:

Compare the item The traditional way Points barrels
The document number 4.32 million 3000
Total document size 432M 198M
Average document size 104K 65K
Index size 172M 91M

We can see that the number of documents and index size reduction after bucket optimization is very objective. However, it is important to note that a single document cannot exceed 16M, which is a performance degradation compared to INSERT, UPDATE, or upsert.

4. Data paging

(1) Traditional paging mode

For example, article list pagination (UI shows previous page, next page, 1, 2, 3, 4……. Jump to what page). The traditional way is by page number (current page number), page size (number of pages displayed per page). For example, if we want to query the second page of data with 20 entries per page, the query statement would look like this:

db.test.find().limit(20).skip(20)
Copy the code

As the number of pages increases, our skip value will become extremely large. Let’s display the performance analysis of this statement:

db.test.find().limit(20).skip(2500000).explain("executionStats")
Copy the code

The results are as follows:

{..."executionStats" : {
		"executionSuccess" : true."nReturned" : 20."executionTimeMillis" : 733."totalKeysExamined" : 0."totalDocsExamined" : 2500020."executionStages" : {
			"stage" : "LIMIT"."nReturned" : 20."executionTimeMillisEstimate" : 8."works" : 2500022."advanced" : 20."needTime" : 2500001."needYield" : 0."saveState" : 19531."restoreState" : 19531."isEOF" : 1."limitAmount" : 20."inputStage" : {
				"stage" : "SKIP"."nReturned" : 20."executionTimeMillisEstimate" : 7."works" : 2500021."advanced" : 20."needTime" : 2500001."needYield" : 0."saveState" : 19531."restoreState" : 19531."isEOF" : 0."skipAmount" : 0."inputStage" : {
					"stage" : "COLLSCAN"."nReturned" : 2500020."executionTimeMillisEstimate" : 6."works" : 2500021."advanced" : 2500020."needTime" : 1."needYield" : 0."saveState" : 19531."restoreState" : 19531."isEOF" : 0."direction" : "forward"."docsExamined" : 2500020}}}},}Copy the code

Although we only query 20 data here, MongoDB still scans 250W skip records. This operation is realized by cursor iterator, which has high CPU consumption and slow response when data reaches tens of millions of levels.

(2) Cursor paging

This scheme is more practical for App, and only requires the scene of the previous page and the next page.

The query statement looks like this when the cursor is passed to the cursor using two parameters: cursor, direction, up, or down

db.test.find({"id":{$gt:2999981}}).limit(20)
Copy the code

Cursor usually selects the primary key, so that the query efficiency is quite high, do not need to traverse the data that is not needed, through the index directly locate the cursor value.

(3) Compromise treatment

Similar to Google search, achieve page number paging, up and down the page, but can not already select a page number. Because of the large number of pages in a large volume, it is not possible to display them all, so we group the page numbers (e.g., groups of 10 pages) and always display a set of pages on the interface.

Suppose we present 10 data items per page. Currently on page 10, we want to view the data on page 13.

db.test.find({"_id":{$gt:100}}).skip(10*2).limit(10)
Copy the code

Here the value after $gt is the ID (cursor) of the last record on page 10, skip calculation formula: page_size * skip several pages (here 10 hops per page, from 10 to 13, skip 2 pages). Although there is skip here, but because the starting point of _id is limited, and the fixed paging group is 10 pages, skip 10*10 records at most, which is also very fast.