And the interviewer blows MongoDB replica sets like this!

Why high availability?

When we used Mysql in the past, we often used one server to dominate the world. If it was only used for learning, there would be no problem. However, in the production environment, such risks are very high.

So what should be done? Since one machine does not work, I can use more machines, such as two, let them master each other, mutual synchronization of data. I just want to say one word, steady.

In fact, redis,mongodb, Kafka and other distributed applications are basically the same idea

MongoDB has similar ideas. The MongoDB replicate set consists of a set of Mongod processes, including a Primary node and multiple Secondary nodes. All data of the MongoDB Driver(client) is written to the Primary. Secondary Synchronizes data written from the Primary to ensure that all members of the replication set store the same data set and provide high availability of data.

To be a primary node, you must ensure that most of the nodes agree, and most of the nodes are more than half the members of the replica.

Total number of members	Most of the	Failure tolerance number
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2
7	4	3

Why ask for a majority? To avoid having two primary nodes. For example, in a five-node replica set, three members are unavailable and the remaining two are still working. These two work node unable to meet the requirements of most replicate sets (in this case request to have three nodes is the most), so they can’t choose the master node, even if is the primary one node, when it noticed it unable to get the majority of nodes, it will step down, become the backup node.

If these two nodes are allowed to select the primary node, the problem is that the other three nodes may not actually be down, but simply unreachable. The other three nodes must be able to select the primary node, so there are two primary nodes. So requiring a majority avoids the problem of producing two primary nodes.

If directing a replicate set can have multiple primary node, you will by the problem of conflict, conflict resolution in support multithreaded programming system manual solution and that the operating system choose a this two ways, but these two approaches are not easy to achieve, there is no guarantee that write data is not modified by other nodes, So mongodb only supports a single primary node, which makes development easier.

When a backup node is unable to communicate with the master node, it will contact and ask other members of the replica set to vote it as the master node. The other members will perform several rational checks: Can they communicate with the master node? Is the data of the backup node that you want to elect as the master up to date? Are there any other higher-priority members that can be elected as master nodes? If a candidate node member gets a majority of votes, it becomes the primary node. But if only one of the majority members vetoes the vote, it is cancelled.

Large negative numbers can be seen in the logs, as one negative vote equals 10,000 yeses. If there are two yes votes and two no votes, then the result of the election is -19998, and so on.

Configuration options

We typically deploy with a replica set of at least 3 nodes (because it allows 1 failure), which means that data is copied three times.

Mediators (arbiter)

A lot of people don’t want to save three copies of their application, they just want two copies, and a third copy is a waste. This deployment of MongoDB is also supported. It has a special member called arbiter, its only role is to participate in the election, it does not save data nor provide services for the client, just to help only two members of the copy set to meet the majority of this condition.

There are drawbacks to arbitrators. If one node does fail (data cannot be recovered), the other member is called the master node. For data security, a new backup node is required and data from the primary node is backed up to the backup node. Copying data puts a lot of strain on the server and can slow down applications. On the contrary, if there are three data members, even if one of them fails, there is still a master node and a backup node, which does not affect normal operation. You can also use the remaining backup node to initialize a new backup node server, independent of the primary node. So if possible, use an odd number of data members in the replica instead of using arbitrators.

Priority (priority)

If I want a node to have a better chance of becoming primary, I need to set the priority. For example, I add a member with priority 2 (default: 1).

rs.add({"_id": 4."host": "10.17.28.190:27017"."priority" : 2});
Copy the code

Assuming that all other priorities are default, as long as 10.17.28.190 has the latest data, the current primary node is automatically demoted and 10.17.28.190 is elected as the new primary node. If its data is not new enough, the current master node remains unchanged.

If priority is set to 0, the primary node will not be selected.

The right to vote (that)

Since the replicated set has a maximum of 50 members and a maximum of seven Primary members vote, the vote of the other members must be set to 0(priority must also be 0). Although non-voting members do not vote in elections, these members own copies of the replica set data and can be read from client applications.

Hidden Members

Clients do not send requests like hidden members, and hidden members are not used as replication sources (although hidden members are not available when other replication sources are available). So many people hide less powerful servers or backup servers. You can set hidden by setting hidden:true. Only priority 0 can be hidden. The Hidden node can be used to do some data backup and offline calculation tasks without affecting the services of the replication set

Delayed Backup node (slaveDelay)

Data can be destructively damaged by human error. To prevent such problems, you can use slaveDelay to set up a delayed backup node.

Delayed backup node data returns later than the primary node (in seconds). SlaveDelay requires a priority of 0. If the application will route read requests to the backup node, the delayed backup node should be hidden so that read requests will not be routed to the delayed backup node.

The data of the Delayed node is lagging behind the data of the Primary node for a period of time. When the incorrect or invalid data is written to the Primary node, the data of the Delayed node can be used to restore the previous time point.

Modify the replica set configuration

For example, if I have a replica set called rs0 and I want to add or remove members, I can change the configuration of members (vote,hidden,priority, etc.) by using the reconfig command

Docs.mongodb.com/manual/refe…

cfg = rs.conf();
cfg.members[1].priority = 2;
rs.reconfig(cfg);
Copy the code

synchronous

Data is synchronized between the Primary and Secondary through Oplog. After the write operation on the Primary is completed, an Oplog will be written to the special set of local.oplog. Rs, and the Secondary continuously fetches new Oplog from the Primary and applies it.

As oplog data is capped, local.oplog. Rs is set up as a capped collection, with the oldest data deleted when capacity reaches the configured maximum. In the process of replication, data is copied before oplog is written. Therefore, Oplog must be idempotent, that is, repeated application will obtain the same result.

After I insert a piece of data into the test library’s coll collection (db.col.insert ({count:1})), I call db.ismaster () to see the last write timestamp of the current node

> db.isMaster()
{
  "ismaster" : true."secondary" : false."lastWrite" : {
    "opTime" : {
            "ts" : Timestamp(1572509087, 2),
            "t" : NumberLong(1)
    },
    "lastWriteDate" : ISODate("2019-10-31T08:04:47Z"),
    "majorityOpTime" : {
            "ts" : Timestamp(1572509087, 2),
            "t" : NumberLong(1)
    },
    "majorityWriteDate" : ISODate("2019-10-31T08:04:47Z")}}Copy the code

Commands return a lot of data, and I’ve just listed a few here, If the current node is not primary, the primary attribute tells you which node the current primary node is, and the timestamp of the last write is 1572509087.

At this point, we log in to another secondary node, switch to the local database, execute the command db.oplog.rs.find(), and many pieces of data will be returned. Here, we can check the last one

{ 
  "ts" : Timestamp(1572509087, 2), 
  "t" : NumberLong(1), 
  "h" : NumberLong("6139682004250579847"), 
  "v": 2."op" : "i"."ns" : "test.coll"."ui" : UUID("1be7f8d0-fde2-4d68-89ea-808f14b326da"), 
  "wall" : ISODate("The 2019-10-31 T08:04:47. 925 z"), 
  "o" : { 
    "_id" : ObjectId("5dba959fcf287dfd8727a1bf"), 
    "count": 1}}Copy the code

We can see that oplog’s ts and isMater() return the same value of lasttime.optime.ts, proving that our data is up to date. If you visit other nodes at this time to look at oplog. Let’s explain the field meanings

Ts: operation time, current timestamp + counter, counter is reset every second
H: globally unique identifier of an operation
V: Oplog version information
Op: indicates the operation type
- I: insert operation
- U: Update operation
- D: Delete operation
- C: Run commands such as createDatabase and dropDatabase.
N: Empty operation, special use
Ns: the set for which the operation is performed
O: Operation content, you can see that I inserted the field count, the value of 1
O2: Indicates the operation query condition. This field is contained only in the update operation

Initial synchronization

Once the members in the replica set are started, they check their state to see if they can be synchronized from a member. If that doesn’t work, it tries to make a full data copy from another member of the replica. This process is known as initial syncing.

The init sync process consists of the following steps

Preparations Delete all existing databases and start synchronization with a new state
Copy all records from the source to local (except local)
In the first step of oplog synchronization, all operations during the cloning process are recorded in oplog. If a document is moved during the cloning process, it may be omitted, causing the document not to be cloned. In this case, you may need to clone the document again
In the second step of oPLOg synchronization, record the operations in the first oplog synchronization
Create indexes
If the data on the current node is still far behind the source, the final step in the oplog synchronization process is to synchronize all operations during index creation to prevent the member from becoming a backup node.
After the initial synchronization is complete, switch to the normal synchronization state, at which point the current member can be called the backup node.

Select * from Primary’s local.oplog. Rs set; select * from Primary’s local.oplog.

Query fixed collection use tailable cursor (docs.mongodb.com/manual/core…).

Primary election

In addition to the Primary election, which occurs when the replication set is initialized, the following scenarios exist

Replicate set reconfig
The Secondary node detects that the Primary is down and triggers the election of a new Primary
When a Primary node actively stepDown (actively demoted to Secondary), a new Primary election will also be triggered

The Primary election is affected by various factors, such as the heartbeat between nodes, priority, and latest Oplog time.

Internode heartbeat

By default, the replication set sends heartbeat messages every 2s. If the replication set does not receive heartbeat messages from a node every 10s, the node is considered to be down. If the node is Primary, Secondary (if it can be selected as Primary) initiates a new Primary election.

The purpose of the heartbeat is to know the status of other members, which is the primary node, which is available as a synchronization source, which is down, and so on

Member Status:

STARTUP: Enters the STARTUP2 state after the replica set is successfully loaded
STARTUP2: The entire initialization synchronization is in this state, and MongDB creates a few threads for processing replication and elections, and then switches to the RECOVERING state
RECOVERING: normal operation, when read requests cannot be processed temporarily. If a member is in this state, it may cause a minor system overload
ARBITER: the state of the ARBITER
DOWN: IF a normal member is unreachable, it is in DOWN state. This state may be a network problem
UNKNOWN: If a member cannot reach any other member, the other member knows what state it is in. Indicates that the unknown member has died. Or there is a network access problem between two members.
REMOVED: The state in which a set of copies was REMOVED, REMOVED to its normal state after being added
ROLLBACK: Indicates that data is in the ROLLBACK state. After the rollback is complete, it changes to the RECOVERING state, and then becomes a backup node.
FATAL: An irreversible error occurs, and there is no further attempt to recover. This is usually the time to restart the server

Node priority

Each node tends to vote for the node with the highest priority
A node whose priority is 0 does not initiate a Primary election
When the Primary finds Secondary with a higher priority and the data of the Secondary lags behind within 10s, the Primary will actively degrade, so that the Secondary with a higher priority has a chance to become the Primary.

OpTime

Only the node with the latest optime (the latest oplog timestamp) can be selected as the primary node. See oplog. Rs analysis above.

Network partition

Only when most voting nodes are connected to each other can they be selected as Primary. If the Primary node is disconnected from most of the nodes, it will actively downgrade to Secondary. When network partitioning occurs, multiple primarys may occur in a short period of time. Therefore, it is recommended that the Driver set a policy for the majority of successful writes. In this way, even if there are multiple Primary’s, only one Primary can successfully write the majority.

Read and write Settings for a replication set

Read Preference

By default, all Read requests from a replicate set are sent to the Primary, and the Driver can route Read requests to other nodes by setting Read Preference.

Primary: The default rule. All read requests are sent to primary
PrimaryPreferred: Primary preferred. If the Primary is unreachable, request Secondary
Secondary: All read requests are sent to secondary
SecondaryPreferred: Secondary Preferred. When all Secondary is unreachable, the Primary is requested
Nearest: send the read request to the nearest reachable node (ping the nearest node)

Write Concern

By default, the Primary returns after the Write operation, and the Driver sets the Write success rule by setting Write Concern.

The following write Concern rule sets that the write must succeed on most nodes with a timeout of 5s.

db.products.insert(
  { item: "envelopes", qty : 100, type: "Clasp" },
  { writeConcern: { w: majority, wtimeout: 5000 } }
)
Copy the code

The Settings above are for a single request, but you can also modify the default write Concern of the replica set so that it is not set individually for each request.

cfg = rs.conf()
cfg.settings = {}
cfg.settings.getLastErrorDefaults = { w: "majority", wtimeout: 5000 }
rs.reconfig(cfg)
Copy the code

Rollback

The Primary performed a write request and then hung, but the backup node did not have time to replicate the operation. The newly elected primary node will miss this write operation. When the old Primary is restored, some operations are rolled back.

For example, A replication set has two data centers. DC1 has two nodes A(Primary) and B, and DC2 has three nodes C,D, and E. If DC1 fails. The last operation of DC1 is 126, but 126 is not copied to another data center. So the latest operation of the server in DC2 is 125

DC2’s data centers still meet the requirements for the majority of the replica set (five versus DC2’s three), so one of them is elected as the new master node, which continues to handle subsequent writes. When the network is restored, the server in the DC1 center will synchronize operations after 126 from other servers, but it cannot be found. In that case, A and B in DC1 are going to roll back.

Rollback Rollback undoes operations that were not copied before the failure. Servers with 126 operations look for common operation points in oplog of DC2 servers. This will locate 125, which is the last operation that the two data centers match.

At this point, the server looks at the actions that have not been copied and writes the documents affected by those actions to a. Bson file in the ROLLBACK directory under the data directory.

If 126 is an update operation, the server writes the 126 updated document to the collectionname.bson file. If you want to restore the rollback operation, use the mongorestore command.