Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

Project background

Due to the business characteristics of Alibaba B2B, sellers are mainly concentrated in China and buyers are mainly concentrated in foreign countries, so the demand for remote computer rooms in Hangzhou and The United States is derived. Meanwhile, in order to improve user experience, the structure of the whole computer room is double A, which can be written on both sides, thus giving rise to such A product as Otter.

The first version of OTTER can be traced back to 2004 ~ 2005, and the external open source version of otter is the fourth version, which has been developed from July 2011 to now. At present, the synchronization requirements of local/remote computer rooms within Alibaba B2B are basically all on OTTE4.

Current synchronization scale:

  1. The amount of synchronized data is 600 million
  2. File synchronization 1.5TB(2000W images)
  3. Involves synchronization between 200+ database instances
  4. Cluster size of 80+ machines

 

Project introduction

Name: otter [‘ butch ə(r)]

The otter, the data porter

Language: Pure Java development

Location: Based on database incremental log parsing, quasi-real-time synchronization to mysql/ Oracle databases in this room or across rooms.

The working principle of

Principle description:

  1. Get database incremental log data based on Canal open source product. What is Canal, please click

  2. Typical management system architecture: Manager (Web management)+ Node (working node)

A. The manager runtime pushes synchronization configurations to nodes

B. Node reports the synchronization status to the manager

  1. Based on ZooKeeper, it solves distributed state scheduling and allows multiple nodes to work cooperatively.

What is the canal?

Otter is a subproject of otter.

What does otter solve?

  1. Heterogeneous library synchronization

Mysql -> mysql/oracle. (The current open source version only supports mysql delta, the target library can be mysql or Oracle, depending on canal’s function)

  1. Single-server synchronization (RTT < 1ms between databases)

A. Upgrade the database version

B. Migrate data tables

C. Asynchronous secondary index

  1. Cross-machine room synchronization (for example, Alibaba International station is the same database of Hangzhou and The United States machine room, RTT > 200ms, highlights)

A. Equipment room Dr

  1. Two-way synchronization

A. Avoid loopback algorithms (generic solution that supports most relational databases)

B. Data consistency algorithm (to ensure the final consistency of data in double-A machine room mode, highlights)

  1. File synchronization

A. Site mirroring (When data is replicated, associated images are copied, for example, product data and image.)

 

Schematic diagram of single room replication:

Description:

A. Enable the node loadBalancer algorithm. If the node S+ETL falls on a different node, the data will be transmitted over the network.

B. A node can have failover or loadBalancer.

 

Schematic diagram of replication across machine rooms:

Description:

A. Data involves network transmission, and the S/E/T/L stages will be scattered among 2 or more nodes, and multiple nodes will cooperate through ZooKeeper (generally Select and Extract nodes in a machine room). Transform/Load falls on a Node in another machine room)

B. Nodes can have failover/loadBalancer. (Each node in the machine room can be a cluster or multiple machines)

 

Explanation of related nouns

Otter core Model diagram

 

Noun explanation

  • Pipeline: Describes the entire process from the source to the target, which consists of some synchronous mapping processes
  • Channel: A synchronization Channel consisting of one Pipeline in unidirectional synchronization and two pipelines in bidirectional synchronization
  • DataMediaPair: Defines mappings based on business tables, such as source and target tables, field mappings, field groups, and so on
  • DataMedia: Abstract data medium concept, understood as a data table/MQ queue definition
  • DataMediaSource: Abstract data medium source information supplemented by description DateMedia
  • ColumnPair: Defines field mapping
  • ColumnGroup: defines a field mapping group
  • Node: A working Node that handles synchronization and corresponds to a JVM

Otter’s S/E/T/L stage model

Note: in order to better support system extensibility and flexibility, the entire synchronization process abstraction for the Select/Extract/Transform/Load, so four stages.

Select stage: To solve the differences of data sources, for example, access Canal to obtain incremental data, or access other systems to obtain other data.

Extract/Transform/Load stage: ETL model similar to data warehouse, which can be divided into data join, data Transform and data Load