This article is Chen Shi in the “Green Manganhua growth” MongoDB essay competition award-winning article, let’s enjoy the next.



preface





By chance, I saw that Mongo Chinese community held an essay contest, which was quite interesting. Although I am still on my way to become a big man, I still want to participate in it. Hence the article.

I thought about it for a while and found that going from 0 to 1+ would be a good fit for topic 1, which is to talk about how to learn the core technology of MongoDB from shallow to deep. Why is it one plus instead of one? Because I think the 0 is the starting point, the origin, the 1 is the journey, and the 1+ is the continuous progress in this direction. After all, there is no end to learning

How do we talk about this? I am a professional in the field of database, but I have not been in contact with MongoDB for a long time, but luckily I have been in contact with the project to deal with the kernel, so I have studied for a period of time, but also a little income, so this article will definitely talk about how we can learn MongoDB. Since it is a distributed database, it is also necessary to talk about the basic concepts and principles of distributed technology. There’s enough to talk about after this trip. Of course, a single article is indeed unable to describe the whole picture, we might as well try to understand and understand the fundamental abstract and rational things from a high-level perspective, so that no matter what database, I think will benefit a lot.



Introduction to use






Without a doubt, the most useful documentation comes from official documentation, which is rich enough to take a while to learn. But if you just stare at a document for a few weeks, you probably won’t get much use. This kind of document is good for reading and learning, watching and doing.

Official documentation link:

docs.MongoDB.com/

There is a website called DB Rank, which ranks all kinds of database. MongoDB is always very high, which shows its charm. Mongo is a document based NoSQL database system. Note that the document here is not a word,excel concept document, specifically refers to JSON document, {k1: v1, k2: v2,… } it looks like this. As we all know, databases can be categorized at different levels to facilitate understanding and comparison. In addition to document type, there are KV, ColumnFamily, GRAPH, etc. Are there any other documents in the same category as Mongo? Sure, I’ve worked with couchbase, Erlang, document DB before, and when it comes to performance comparisons, I think they’re pretty good, although people like to talk about mysql versus Mongo.

MongoDB in Action is a book that can help explain concepts in a way that official documentation can’t. Although the book focuses on the older version 3.0, I think it will not be too different from the current 4.2 in use, so I will change to another one after I am familiar with the use and basic principles.

I’m not going to parse the various commands here, which you can get from your documentation. Talk about some of the insights.



schema-free


Schema-free means that there is no schema or that the schema is loose, relative to a relational schema. Do you have a vague concept of schema for new database users? I don’t know what Chinese is, a lot of words feel like English. I have copied a definition [1] :

A database schema is the skeletonstructure that represents the logical view of the entire database. It defineshow the data is organized and how the relations among them are associated. Itformulates all the constraints that are to be applied on the data.

It’s a skeleton, the skeleton of a database, that defines its logical view of what it looks like from the outside. Including how the data is organized, how it is related, and what constraints it has. So it’s kind of descriptivedetail, should say at db design stage, to help developers create mental views.

Why focus on Schema here? Because mongo Schema does look different, it looks a little awkward to people used to a relational world. In fact, it can also be said that it is schema-less, just like there is no schema. It is very arbitrary, and you can add fields and attributes as you want. This ability is obviously very suitable for the dynamic development of the business. At the beginning, you don’t know what fields your business has, and you just want to add them when you need them in the future. At this time, Mongo fits in very well at this level. However, relational schemas provide this capability at a higher cost than Mongo.



Applicable scenario


Speaking of which, I’m afraid database people often answer questions. What do you say to a user who asks if his business can connect to MongoDB? It’s easy for him to ask, but it’s not easy for you to answer, or at least you have to ask more questions, because he’s not being clear. What are they?

What kind of business is it? How much data?

What is the read-write ratio? How big is the read/write QPS?

What are the characteristics of reading and writing, such as low peaks at night?

Access Pattern?

and so on

Generally speaking, the business should give this information so that we can judge whether it is appropriate or not. After all, there are so many DBS in the market. If every DB is suitable for every situation, why have so many DBS? It doesn’t feel good to be unified.

However, in some cases, such as in the public cloud environment, the business context may still be secret, he does not want to tell you, or he is not very clear/bad estimate, can you answer this? I don’t know, it’s better to run workload in a test environment. A few days ago, I saw the KeyViz shared by TiDB. I have some ideas. If there are children who are interested in this observability tool, we can discuss it together.

According to MongoDB in Action, mongo can be used in the following scenarios:

  • web app

That’s too broad a range. But it is true that MongoDB is widely used in Web applications. The characteristics of Web applications require high scalability, flexible and rich queries, and dynamic addition of fields

  • Agile development

The main point here is that the advantage of not having a fixed schema makes it suitable for agile development methodologies

  • Analytical and logging

A capped Collection is ideal for logging data, but I don’t see a whole lot of analytics, which is better than dedicated OLAP DB.

  • caching

  • Variable schema



Script building


My personal advice is to write a script to build a cluster by yourself, which is convenient to generate by one key, rather than one command at a time. My script is for your reference [7]


I’ll talk later about debugging the kernel with GDB on the basis of quickly creating custom clusters.




Distributed concepts and principles






That’s too big a field!


MongoDB is a distributed database, and the network distance between nodes is greater than that of a stand-alone database, so all sorts of things can happen (Google “8 fallacies in distributed systems “). For more interesting information on mongo, be sure to refer to DDIA[6], the most popular version so far.


I’m going to talk very simply in my own words about the background, why I need it, and how Mongo does it, and I’m going to suggest you Google more.

Consensus Agreement



background


The simple understanding is to achieve consensus. For those of you who are familiar with this, Raft, the work of Stanford professor John Ousterhout and his PhD student Diego Ongaro, has been applied to a variety of distributed databases such as TiDB and PolarDB.

There are other protocols, of course: Lamport’s Paxos (used in Chubby), Zookeeper’s ZAB, and MongoDB’s PV1.



Why we need it


Simply put, when multiple nodes are making decisions together, how can you decide if you say yours and I say mine? It’s like a bunch of people in a room meeting, they’re talking, they’re not talking, and it’s all for nothing. Similarly, in a distributed system, we need a set of rules that allow the nodes to agree on events and outcomes. This actually fits well with the real-world model.



What did Mongo do


Mongo uses MongoDB PV1, which is a raft protocol, but it has extensive extensions. For example, rs.conf() can configure priority,hidden, vote and other attributes of each node, which has great flexibility. Added PreVote, DryRun, etc. For details, please refer to the relevant documentation.


Isolation Level /Consistency (ACID/Consistency/CAP)



background


The concepts are similar and put together. ACID, ACID, Consistency, Consistency, Consistency, Consistency, Consistency, Consistency


CAP was coined by Brewer in 1992, and is not recommended in many papers today because it is ambiguous;


In many papers, there is also a lot of consistent vocabulary, such as


– Causal consistency, shown in Mongo

– Linearizability, linear consistency, for single object, always read the latest data

– serializability, which emphasizes multiple transactions operating on multiple objects, is the strongest isolation level in relational DB

– strict serializability linearizability + serializability, has mentioned in the Google spanner

-sequence consistency: sequential consistency, weaker than linearizability, e.g. X86 cpus default consistency, we often see in C++ Memory models’ STD ::memory_order_seq ‘


In terms of data security, to ensure persistence, the commonly used techniques are to check checkpoint regularly and have write-ahead log, which is supported by the Native WiredTiger engine layer.



Why we need it


Where there is copy, there is read and write, there is inevitably the problem of whether the latest data can be read, which is a problem of consistency. Some services require strongconsistency to read the latest data, but some services do not require strongconsistency to be released. Eventualconsistency means that all copies of data will be the same within a given period of time. Such implementations are much less complex than strong consistency.





What did Mongo do


About consistency, I have to talk about my own long-standing misunderstanding. It turns out that The Quorom in Mongo is not what we call a quorum!

Cassandra and its C++ product, Scylladb, are based on amazon Dynamo. In this paper, we talk about the quorum model: when there are N nodes, if the write majority (W > N/2) and the read majority (W > N/2) can read the latest written data. However, mongo also has a majority, but its connotation is completely different.


When writing mongo, the client can only write master, but cannot write slave, which is different from leader-less system (no master system, all nodes are peer). Slave pulls data from master. Both the master and slave nodes maintain a majoritycommitted point in time, which pushes forward when most writes have been committed;


If a client specifies readConcern: majority, the majority read is successful (majorityCOMMITTED), and the majority read is successful (majoritycommitted).


Mongo transactions support snapshot isolation, meaning that the transaction can read the last stable point, which may be old data, but it is consistent with other data, thus avoiding read/write conflicts.



Replication and Fault Tolerance



background


In distributed systems, replication is an important and common way to improve availability. In a complex distributed environment, some components will collapse, get stuck, and do not respond. In order not to affect the user’s request, the request needs to be forwarded to the normal node. Then the data must have multiple copies, otherwise how to access the previously accessed data?


Fault redundancy is a classic concept, distributed failures are all kinds of things, software, hardware, human; In a typical single-master system, the absence of a master node affects user read and write, so a new master must be replaced for a short period of time when the previous master node is gone, which is perfect without the user feeling the master at all.



Why we need it


As mentioned earlier, ensure system availability and data security.



What did Mongo do


Mongo is a single-master system and writes can only write to the master node, so it has an election mechanism and relies on the raft like protocol described earlier. This is to ensure fault redundancy;


In terms of replication, the slave node pulls Oplog from the master node, and Oplog can be understood as the log in raft, which reflects the mutation of the master node. The slave node applies this locally to achieve the same state as the master node.


Very detailed instructions are available in the official source code [12].




The kernel





Personal contact with the kernel has not been long, here to cast a brick.


The kernel is actually divided into Server layer and Storage Engine layer, because the Server contact is not complete, only talk about the Engine layer.



The storage engine



Here’s a document [11] generated by Doxygen that’s worth checking out.


Engine layer technology is the core technology of database system, which involves the realization of the core principle of database. First of all, we need to understand that data can be organized in many ways, which is the best way, before the code has not been implemented, I’m afraid we can not say.


Obviously, we need plug-and-plug features here. The database layer (i.e. dry SQL, CQL, query optimization, execution plan, etc.) can have flexible access to multiple storage engines, so that in the end, we can compare who is better and who is worse. Therefore, the engine layer must be very independent and provide the most primitive interface for the upper layer to call, which is also the perfect embodiment of computer layering in the field of database.

The MongoDB engine has been WiredTiger since 3.x, and there seems to have been no official consideration to include RocksDB compatibility code, so MongoRocks is a third party presence; There is also an in-memory engine.



WiredTiger


This is abbreviated WT[8]. WT was started by Michael Cahill, acquired by MongoDB one year ago, and has been the default storage engine on Mongo ever since. We can see the basic introduction of WT here [2].

WT is primarily a KV storage engine, the same category as Rocksdb, but it is much less well known, probably because it is relatively niche, seems to be used only by Mongo, and the code is really not very readable.


The engine index implementation is B tree, not B+ tree. There is a lot of discussion on the Internet about why B tree is used.


1. Mongo focuses on improving point Query performance rather than range Query performance, so that unlike B+ Tree, which has to go to the leaf node every time to get data, on average, it takes a shorter path;

2. Optimize the scenario of reading more and writing less;

3. The other.



Use of WT API



WT is used in mongo, in fact, there are only a few basic calls:


1. Create a connection to CONN


wiredtiger_open(home, NULL,”create,cache_size=**, transaction_sync=**, checkpoint_sync=**,…” ,&conn)


This is called at startup, generating a WT_CONN pointing to DB as a private member of the WiredTigerKVEngine.


2. Create the session


All operations in Mongo have session context. The session in the document corresponds to WT_SESSION in the engine layer. In order to use sessions efficiently, there is a sessionCache to use instead of opening sessions every time

conn->open_session(conn,NULL, “isolation=**”, &session)


3. Create a table or index


When implement createCollection/createIndex mongo layer, namely: sesssion – > create (session, “table: : access,” key_format = S, value_format = “S”))

4. Create a cursor on the session


session->open_cursor(session, “table:mytable”, NULL,NULL,&cursor)

5. To support transactions, enable transactions on the session

session->begin_transaction(session, “isolation=**, read_timestamp=**,sync=**,…” )

6. Use cursor set/get key/value


The json that the user sees, and the BSON that the Mongo Server layer sees, are actually transformed into (key, value) pairs at the bottom

cursor->set_key(cursor,”key”)

cursor->set_value(cursor,”value”)

cursor->update(cursor);

7. Commit/rollback the transaction


session->commit_transaction(session,”commit_timestamp=**, durable_timestamp=**, sync=**,…” )

session->rollback_transaction(session,NULL);

A few clarifications for the above steps:


·WT API calls are in that style, notably char* config, where a= B is used to specify various configuration parameters. It’s a primitive thing to do;

· Parameters related to timestamps are complicated and need in-depth documentation;

· The meaning of parameters should refer to [2].



Timestamp mechanism



According to official documents and videos [14], the introduction of logical session from 3.6 and the addition of timestamp field in the UPDATE structure of WT are paving the way for supporting transactions and distributed transactions.

I’ve been exposed to some WT timestamps concepts to familiarize myself with MongoRocks support for transactions, and I can’t yet systematically describe how timestamps work from one another. This aspect can refer to a lot of [2], I will not talk about here.



MongoRocks



The name must be related to Rocksdb, it is also natural to think of it, since the bottom is connected to KV Engine, Rocksdb is kv, completely accessible ah, just like MyRocks. [3] Stars is also available at 300+, originally implemented by Igor Canadi and others in MongoRocks 3.2, 3.4. The project was put on hold for a while, and Igor Canadi accepted Wolfkdy’s MR[16] for MongoRocks 4.0 a few months ago, in which I participated in relevant PR submissions such as [4].

The implementation of the 4.0 Mongo-Rocks driver layer is mainly focused on the transaction part. As Igor said, after 3.6.x, the internal transaction of Mongo is a big leap, and the proper implementation of 4.0 requires a lot of effort [5].

MongoRocks 4.0 just came out, so it still needs more time to stabilize. For example, the problem of oplog reading with voids was found by me before [13], which has been fixed by the author [15]. I’m really looking forward to Rocksdb being connected to Mongo, I believe there will be brighter points than WT! I think I will spend more time on this and expect more domestic developers to join!



Kernel GDB debugging



Large code, if using GDB step to learn certainly not, single step is only suitable for bug debugging. Why am I talking about GDB debugging here? get runtimepath !

I always think that getting a large C++ project, besides staring at the code for a long time to understand the code flow, using GDB bt is a great tool! Add a breakpoint on the server side, the client sent a command over, and then a BT, immediately know the server to go to the core path, very convenient!

Highlight: please use >= 8.x version of GDB. The upside is that bt comes with a color display and looks much more comfortable than before.



Here’s how to use it in general.


Start a replica set or shard cluster (depending on which one you care about) and set the master to the following:


cfg=rs.conf(); cfg.settings.heartbeatTimeoutSecs=3600; cfg.settings.electionTimeoutMillis=3600000; rs.reconfig(cfg)


Let’s say we want to debug the master. To prevent failover by default during debugging, increase the heartbeat,election timeouts so that the master is always the master (of course, if you want to debug the master code, don’t do this)


When we want to look at the insert request path,


Take a look at the code, search for insert, and you’ll find CmdInsert. A closer look shows that it inherits a base class and that it has a run method. Developers with a sense of it can now guess: Run is likely to be called when the server receives an INSERT request!


So we can add a breakpoint at run, or we can find the word insertRecords in grep, and we can determine that the document was probably inserted here, so we have this:



So the path from libc.so start_thread to run to insertRecords is long enough to analyze how it goes.

Similarly for find, Update, and delete.

For transactions, you can grep transaction and find functions that can be used as breakpoints. Begin_transaction, COMMIT_TRANSACTION, rollback_TRANSACTION are familiar function names. Suitable for adding breakpoints.




conclusion






MongoDB technology on this, the amount of knowledge is very large, it is not by an article to do. To me, the connotation itself is fascinating, because it’s a database, it’s a distributed system, and it has a lot of problems. Despite the official tightening of the Mongo agreement, some cloud vendors are no longer able to play the higher version. But I think as long as it’s open source and the code is real, it’s still a good thing for engineers. From the shallow to the deep, from now on!



Author: Chen Shi


I am a technical person who is passionate about database, distributed and storage technology, and also interested in Linux kernel and microprocessor architecture. At the beginning of my work, I got to know NoSQL database such as Redis/Couchbase/Scylladb, and now I am engaged in mongo cloud database development in Tencent. In spare time, I like climbing mountains, researching papers and studying humanities.


=== I am not an advertisement

Tencent Cloud CMongo team is committed to creating a refined MongoDB cloud service, welcome interested partners to join, or find me to communicate with you, haha

Email: [email protected]

References


[1]https://www.tutorialspoint.com/dbms/dbms_data_schemas.htm

[2]http://source.wiredtiger.com/

[3]https://github.com/mongodb-partners/mongo-rocks

[4] https://github.com/mongodb-partners/mongo-rocks/pull/153

[5]https://github.com/mongodb -partners/mongo-rocks/issues/145

[6] MartinKleppmann DDIA: Designing data-intensive Applications

[7]https://gitee.com/cshi/codes/dbzrmhvy4s87lnt60uc2f46

[8] https://github.com/wiredtiger/wiredtiger

[9]https://www.cnblogs.com/williamjie/p/10416294.html

[10] 4.0 transactions analyses: https://mongoing.com/archives/6102

[11] storage engine API: https://mongodbsource.github.io/doxygen/index.html

[12] Source code for replication: https://github.com/mongodb/mongo/blob/d8caa7410ce2642d1c67f31c330f6d82ba384495/src/mongo/db/repl/README.md

[13]https://github.com/mongodb-partners/mongo-rocks/issues/154

[14]https://www.youtube.com/watch?v=mUbM29tB6d8

[15]https://github.com/wolfkdy/rocksdb/commit/5eb8a67f955b4035d6c034e00f1bb7c6bb6f47d4

[16]https://github.com/mongodb-partners/mongo-rocks/pull/149


Thanks to the domestic leading database (MongoDB) and CDN (Akamai) service provider Shanghai Jinmu Information Technology Co., Ltd. for the strong support of this essay!

Captain America

Mongoing Chinese Community (Mongoing.com) founded in 2014, Mongoing Chinese Community is the officially recognized Chinese community in Greater China. Thanks to the continuous efforts of community volunteers, Mongoing Chinese Community now has more than 20,000 online and offline members. The Chinese community consists of blogs, offline activities, technical questions and answers, communities, official document translation and other sections. By 2020, the community has successfully held dozens of offline activities with more than 100 people, published more than 100 high-quality articles on MongoDB application, and more than 20 related cooperative units.

The Chinese community’s vision is: to create an active mutual assistance platform for the majority of MongoDB Chinese enthusiasts; Promote MongoDB to become the preferred solution of enterprise database application; Gather MongoDB development, database, operation and maintenance experts to build the most authoritative technical community.

Mongoing Chinese community public account: Mongoing – Mongoing

Mongoing Chinese community mongoing.com/

Captain America


Shanghai Jinmu Information Technology Co., Ltd. is a leading domestic MongoDB database service provider and an official partner of MongoDB vendors.

Jinmu Information always sticks to the solid daylight and forward in the field of data technology, becoming an emerging technical force in the domestic MongoDB field. Our clients are widely distributed in finance, telecom, retail, aviation and other industries, helping users to complete the smooth transition from traditional IT architecture to Internet architecture.

Since 2018, Jinmu Information has established a good cooperative relationship with the MongoDB Chinese community, and is committed to jointly creating a prosperous MongoDB ecological environment.

Shanghai Jinmu Information Technology Co., LTD www.jinmuinfo.com/