Source: Hang Seng LIGHT Cloud Community
With the expansion of the system volume, can use distributed way, at the same time there will be a lot of middleware, such as redis, message queues, and large data storage, etc., but the actual core data storage is still stored in the database, multiple database data real-time synchronization problems before, in order to solve this problem, Some real-time data synchronization middleware is needed to solve the problem.
Introduction of Canal
Canal is an open source incremental subscription and consumption component of Ali based on Mysql database binlog. Through it, you can subscribe to database binlog logs and then consume some data, such as data mirroring, data heterogeneity, data index, cache update, etc. Compared with message queues, this mechanism can achieve the ordering and consistency of data.
Design background
In the early stage, alibaba B2B company had dual machine room deployment in Hangzhou and the United States, so there was a business demand for cross-machine room synchronization. However, the early database synchronization business was mainly to obtain incremental changes based on the trigger method. However, since 2010, Alibaba started to try log parsing based on the database step by step to obtain incremental changes for synchronization, and thus derived the incremental subscription & consumption business, which opened a new era.
At present, the internal synchronization already supports the log parsing of MYSQL5. x and some versions of Oracle.
Incremental subscription & Consumption based log supported business:
- Database mirroring
- Real-time database backup
- Multilevel index (seller and buyer index separately)
- search build
- Service Cache Refresh
- Price changes and other important business news
The working principle of
Implementation of mysql active/standby replication
From the top, replication is divided into three steps:
master
Log changes to the binary log (binary log
These records are called binary log events,binary log events
, can be passedshow binlog events
To view);slave
willmaster
thebinary log events
Copy to its relay log (relay log
);slave
Redoing events in the relay log will change the data that reflects itself.
How Canal works
The principle is relatively simple:
canal
simulationmysql slave
The interaction protocol, pretending to bemysql slave
tomysql master
senddump
agreementmysql master
receiveddump
Request, start pushbinary log
toslave
(i.e.canal
)canal
parsingbinary log
Object (originally a byte stream)
The underlying architecture
Description:
server
On behalf of acanal
Run instance, corresponding to ajvm
instance
Corresponds to a data queue (1server
Corresponding to 1.. ninstance
)
The instance module:
eventParser
(Data source access, simulated interaction between slave protocol and master protocol, protocol resolution)eventSink
(Parser and Store linkers do the work of data filtering, processing, and distribution)eventStore
(Data storage)metaManager
(Incremental Subscription & Consumer Information Manager)
EventParser design
The entire Parser process can be divided into several steps:
Connection
Gets the location where the last successful resolution was made (or, if first started, the initially specified location or the current databasebinlog
Site)Connection
Create a link and send itBINLOG_DUMP
1. Write 4 bytes bin-log position to start at // 2. Write 2 bytes bin-log flags // 3. write 4 bytes server id of the slave // 4. write bin-log file nameMysql
Began to pushBinaly Log
- The received
Binaly Log
Through theBinlog parser
// Add field name, field type, primary key information, unsigned type processing - Passed to the
EventSink
Module to store data is a blocking operation until the storage succeeds - After the storage is successful, record the data periodically
Binaly Log
location
EventSink design
Description:
- Data filtering: Supports wildcard filtering modes, table names, and field contents
- Data routing/distribution: Solution 1:n (1)
parser
Corresponding to multiplestore
The pattern) - Data merge: Solve n:1 (multiple)
parser
Corresponding to astore
) - Data processing: in the entry
store
Before we do extra processing, likejoin
EventStore design
- 1. Only implemented so far
Memory
Memory mode. Local mode will be added laterfile
Storage,mixed
Mixed mode - 2. Draw lessons from
Disruptor
theRingBuffer
The realization idea of
RingBuffer design.
Three cursors are defined
- Put: indicates the position where the Sink module last writes data to the data store
- Get: The location where the data subscription was last extracted
- Ack: Location of the last successful consumption of data
The Instance design
The instance represents a practical operation data queue, including EventPaser, EventSink, EventStore components, etc.
Abstract CanalInstanceGenerator to consider the configuration of the management mode:
manager
How: Connect to your own internal Web Console/Manager system. (Currently mainly used within the company)spring
Method: Based on spring XML + Properties definition, build spring configuration.
Server design
Server represents a running instance of Canal, and the two implementations of Embeded/Netty are abstracted for componentization purposes
Embeded
:latency
And usability requirements, and the ability to handle distributed related technologies (e.gfailover
)Netty
Based on:netty
Encapsulates a layer of network protocols created bycanal server
To ensure its availability, the pull model adopted, of courselatency
There will be a slight discount, but it depends. It’s an Alinotify
andmetaq
The typicalpush/pull
Model, is also gradually in the currentpull
The models are close together,push
There are some problems with large amounts of data.)
Mechanism of HA
Canal HA is divided into two parts. There are corresponding HA implementations for canal server and Canal client respectively
canal server
: To reduce the number of pairsmysql dump
The request is differentserver
On theinstance
You can only have one at a timerunning
, and the others are in standby state.canal client
: To ensure order, oneinstance
Only one at a timecanal client
forget/ack/rollback
Otherwise, client reception cannot be guaranteed in order.
Control of the entire HA mechanism relies on several ZooKeeper features, the Watcher and EPHEMERAL nodes (which are bound to the session lifecycle).
Canal Server:
General steps:
canal server
To start somethingcanal instance
When the firstzookeeper
Make a try startup judgment (implementation: create EPHEMERAL node, and allow whoever succeeds to start)- create
zookeeper
After the node succeeds, the correspondingcanal server
You start the correspondingcanal instance
Is not created successfullycanal instance
It will be in the standby state - Once the
zookeeper
foundcanal server A
When a created node disappears, other nodes are notified immediatelycanal server
Repeat Step 1 and select a new onecanal server
Start theinstance
. canal client
Each time you connect, you will first connect tozookeeper
Asks who is currently activatedcanal instance
If the link is not available, connect will be tried again.
Similar to the Canal Server, the Canal Client is controlled by preempting the EPHEMERAL node of ZooKeeper.
Application scenarios
Data synchronization
- In the microservice development environment, in order to improve the efficiency of search, as well as the accuracy of search, will be used in large quantities
Redis
,Mongodb
Etc.NoSQL
Database, also will use a lot ofSolr
,Elasticsearch
Full-text search services. So, at this time, there will be a problem we need to think about and solve: that is the problem of data synchronization! - Canal can synchronize data from a database that changes in real time to
Redis
.Mongodb
orSolr
/Elasticsearch
In the.
Heterogeneous data
In large site architectures, DB will use separate libraries and tables to solve capacity and performance problems, but this brings new problems. Queries from different dimensions or aggregate queries, for example, can be tricky. Generally, we will solve this problem through the data heterogeneity mechanism. Data heterogeneity is to aggregate the multiple tables that need to be queried by join into a DB according to a certain dimension. Let you inquire. Canal is one way to achieve data heterogeneity.