Source: Hang Seng LIGHT Cloud Community

With the expansion of the system volume, can use distributed way, at the same time there will be a lot of middleware, such as redis, message queues, and large data storage, etc., but the actual core data storage is still stored in the database, multiple database data real-time synchronization problems before, in order to solve this problem, Some real-time data synchronization middleware is needed to solve the problem.

Introduction of Canal

Canal is an open source incremental subscription and consumption component of Ali based on Mysql database binlog. Through it, you can subscribe to database binlog logs and then consume some data, such as data mirroring, data heterogeneity, data index, cache update, etc. Compared with message queues, this mechanism can achieve the ordering and consistency of data.

Design background

In the early stage, alibaba B2B company had dual machine room deployment in Hangzhou and the United States, so there was a business demand for cross-machine room synchronization. However, the early database synchronization business was mainly to obtain incremental changes based on the trigger method. However, since 2010, Alibaba started to try log parsing based on the database step by step to obtain incremental changes for synchronization, and thus derived the incremental subscription & consumption business, which opened a new era.

At present, the internal synchronization already supports the log parsing of MYSQL5. x and some versions of Oracle.

Incremental subscription & Consumption based log supported business:

  1. Database mirroring
  2. Real-time database backup
  3. Multilevel index (seller and buyer index separately)
  4. search build
  5. Service Cache Refresh
  6. Price changes and other important business news

The working principle of

Implementation of mysql active/standby replication

From the top, replication is divided into three steps:

  1. masterLog changes to the binary log (binary logThese records are called binary log events,binary log events, can be passedshow binlog eventsTo view);
  2. slavewillmasterthebinary log eventsCopy to its relay log (relay log);
  3. slaveRedoing events in the relay log will change the data that reflects itself.

How Canal works

The principle is relatively simple:

  1. canalsimulationmysql slaveThe interaction protocol, pretending to bemysql slavetomysql mastersenddumpagreement
  2. mysql masterreceiveddumpRequest, start pushbinary logtoslave(i.e.canal)
  3. canalparsingbinary logObject (originally a byte stream)

The underlying architecture

Description:

  • serverOn behalf of acanalRun instance, corresponding to ajvm
  • instanceCorresponds to a data queue (1serverCorresponding to 1.. ninstance)

The instance module:

  • eventParser(Data source access, simulated interaction between slave protocol and master protocol, protocol resolution)
  • eventSink(Parser and Store linkers do the work of data filtering, processing, and distribution)
  • eventStore(Data storage)
  • metaManager(Incremental Subscription & Consumer Information Manager)

EventParser design

The entire Parser process can be divided into several steps:

  1. ConnectionGets the location where the last successful resolution was made (or, if first started, the initially specified location or the current databasebinlogSite)
  2. ConnectionCreate a link and send itBINLOG_DUMP1. Write 4 bytes bin-log position to start at // 2. Write 2 bytes bin-log flags // 3. write 4 bytes server id of the slave // 4. write bin-log file name
  3. MysqlBegan to pushBinaly Log
  4. The receivedBinaly LogThrough theBinlog parser// Add field name, field type, primary key information, unsigned type processing
  5. Passed to theEventSinkModule to store data is a blocking operation until the storage succeeds
  6. After the storage is successful, record the data periodicallyBinaly Loglocation

EventSink design

Description:

  • Data filtering: Supports wildcard filtering modes, table names, and field contents
  • Data routing/distribution: Solution 1:n (1)parserCorresponding to multiplestoreThe pattern)
  • Data merge: Solve n:1 (multiple)parserCorresponding to astore)
  • Data processing: in the entrystoreBefore we do extra processing, likejoin

EventStore design

  • 1. Only implemented so farMemoryMemory mode. Local mode will be added laterfileStorage,mixedMixed mode
  • 2. Draw lessons fromDisruptortheRingBufferThe realization idea of

RingBuffer design.

Three cursors are defined

  • Put: indicates the position where the Sink module last writes data to the data store
  • Get: The location where the data subscription was last extracted
  • Ack: Location of the last successful consumption of data

The Instance design

The instance represents a practical operation data queue, including EventPaser, EventSink, EventStore components, etc.

Abstract CanalInstanceGenerator to consider the configuration of the management mode:

  • managerHow: Connect to your own internal Web Console/Manager system. (Currently mainly used within the company)
  • springMethod: Based on spring XML + Properties definition, build spring configuration.

Server design

Server represents a running instance of Canal, and the two implementations of Embeded/Netty are abstracted for componentization purposes

  • Embeded:latencyAnd usability requirements, and the ability to handle distributed related technologies (e.gfailover)
  • NettyBased on:nettyEncapsulates a layer of network protocols created bycanal serverTo ensure its availability, the pull model adopted, of courselatencyThere will be a slight discount, but it depends. It’s an AlinotifyandmetaqThe typicalpush/pullModel, is also gradually in the currentpullThe models are close together,pushThere are some problems with large amounts of data.)

Mechanism of HA

Canal HA is divided into two parts. There are corresponding HA implementations for canal server and Canal client respectively

  • canal server: To reduce the number of pairsmysql dumpThe request is differentserverOn theinstanceYou can only have one at a timerunning, and the others are in standby state.
  • canal client: To ensure order, oneinstanceOnly one at a timecanal clientforget/ack/rollbackOtherwise, client reception cannot be guaranteed in order.

Control of the entire HA mechanism relies on several ZooKeeper features, the Watcher and EPHEMERAL nodes (which are bound to the session lifecycle).

Canal Server:

General steps:

  1. canal serverTo start somethingcanal instanceWhen the firstzookeeperMake a try startup judgment (implementation: create EPHEMERAL node, and allow whoever succeeds to start)
  2. createzookeeperAfter the node succeeds, the correspondingcanal serverYou start the correspondingcanal instanceIs not created successfullycanal instanceIt will be in the standby state
  3. Once thezookeeperfoundcanal server AWhen a created node disappears, other nodes are notified immediatelycanal serverRepeat Step 1 and select a new onecanal serverStart theinstance.
  4. canal clientEach time you connect, you will first connect tozookeeperAsks who is currently activatedcanal instanceIf the link is not available, connect will be tried again.

Similar to the Canal Server, the Canal Client is controlled by preempting the EPHEMERAL node of ZooKeeper.

Application scenarios

Data synchronization

  • In the microservice development environment, in order to improve the efficiency of search, as well as the accuracy of search, will be used in large quantitiesRedis,MongodbEtc.NoSQLDatabase, also will use a lot ofSolr,ElasticsearchFull-text search services. So, at this time, there will be a problem we need to think about and solve: that is the problem of data synchronization!
  • Canal can synchronize data from a database that changes in real time toRedis.MongodborSolr/ElasticsearchIn the.

Heterogeneous data

In large site architectures, DB will use separate libraries and tables to solve capacity and performance problems, but this brings new problems. Queries from different dimensions or aggregate queries, for example, can be tricky. Generally, we will solve this problem through the data heterogeneity mechanism. Data heterogeneity is to aggregate the multiple tables that need to be queried by join into a DB according to a certain dimension. Let you inquire. Canal is one way to achieve data heterogeneity.