This article has been documented

(Friends in need can follow my wechat public number)

1. Concept of data segmentation

Data Sharding can be divided into two Sharding modes according to the types of Sharding rules. One is to shard data to different databases (hosts) according to different tables (or schemas), which can be called vertical (vertical) data shard. The other is to split the data in the same table to multiple databases (hosts) according to certain conditions according to the logical relationship of the data in the table, which is called horizontal (horizontal) segmentation of data. The biggest characteristic of vertical segmentation is simple rules and more convenient implementation. It is especially suitable for the system with very low coupling degree, little mutual influence and very clear business logic between different businesses. In such a system, it is easy to separate the tables used by different business modules into different databases. Split by different tables has less impact on the application and the split rules are simpler and clearer. Horizontal sharding is a little more complicated than vertical sharding. Because different data from the same table is split into different databases, the splitting rules themselves are more complex for the application than splitting by table name, and later data maintenance is also more complex.

Vertical segmentation

A database consists of many tables, and each table corresponds to different business. Vertical segmentation refers to the classification of tables according to business and distribution to different databases, so as to share data or pressure to different libraries, as shown in the following figure:

Generally speaking, it is difficult to segment businesses with complex join scenarios, and it is easy to segment businesses independently. How and to what extent to shard is a challenge to the technical architecture. The advantages and disadvantages of vertical segmentation are analyzed as follows: After the split, the services are clear and the rules are clear. ? Easy integration or extension between systems. ? Data maintenance is simple. Disadvantages:? Some service tables cannot be joined and can only be solved through interfaces, which increases system complexity. ? Due to the different limitations of each service, the single library performance bottleneck is difficult to expand data and improve performance. ? Transaction processing is complex. Because vertical sharding is to disperse tables to different libraries according to the classification of business, some business tables are too large, and there is a bottleneck of single library read/write and storage, so horizontal split is needed to solve the problem.

Generally speaking, it is difficult to segment businesses with complex join scenarios, and it is easy to segment businesses independently. How and to what extent to shard is a challenge to the technical architecture. The advantages and disadvantages of vertical segmentation are analyzed below:

Advantages:

  • After the split, the services are clear and the rules are clear.
  • Easy integration or extension between systems.
  • Data maintenance is simple.

Disadvantages:

  • Some service tables cannot be joined and can only be solved through interfaces, which increases system complexity.
  • Due to the different limitations of each service, the single library performance bottleneck is difficult to expand data and improve performance.
  • Transaction processing is complex. Because vertical sharding is to disperse tables into different libraries according to the classification of services, some business tables are too large and there are bottlenecks in reading, writing and storing single libraries. Therefore, horizontal split is required to solve the problem.

The level of segmentation

In contrast to vertical splitting, horizontal splitting does not classify tables, but distributes them into multiple libraries according to certain rules of a certain field, with each table containing a portion of the data. To put it simply, we can understand the horizontal partitioning of data as the partitioning by data rows, that is, some rows in the table are segmented to one database and some other rows are segmented to other databases, as shown in the figure below:

To split data, you need to define sharding rules. Relational databases are two-dimensional models of rows and columns. The first rule of resolution is to find the resolution dimensions. Several typical sharding rules include:

  • Modular according to user ID, the data is dispersed to different databases, and the data of users with the same data are dispersed to a library.
  • According to the date, the data of different month or even day is scattered into different libraries.
  • Search by a specific field, or split into different libraries according to a specific range.

Advantages:

  • Split rule abstraction is good, join operation can be done by the database.
  • There is no single database big data, high concurrency performance bottleneck.
  • Application side modification is less.
  • Improve the system stability and load capacity.

Disadvantages:

  • The split rules are difficult to abstract.
  • Shard transaction consistency is difficult to resolve.
  • It is very difficult to expand data for many times and maintain it.
  • Cross-library join performance is poor.

Both vertical and horizontal segmentation have disadvantages, but the common disadvantages are:

  • Introduce the problem of distributed transactions.
  • Problems with cross-node joins.
  • Merge sort paging problems across nodes.
  • Multiple data source management issues.

For data source management, there are two main ideas:

A. Client mode, configure and manage one (or more) data sources it needs in each application module, directly access each data library and complete data integration in the module;

B. All data sources are managed uniformly through the intermediary agent layer, and the back-end database cluster is transparent to the front-end application;

Centralized proxy Mycat

Embedded application decentralized architecture Sharding-JDBC

Decentralized architecture sharding-Sidecar Service Mesh based on host process

Principles of data segmentation

First principle: can not cut as far as possible do not cut.

Second principle: if you want to cut, you must choose the appropriate rules for cutting, and plan well in advance.

Third principle: data sharding tries to reduce the possibility of cross-library Join through data redundancy or table grouping.

Fourth principle: As it is difficult for database middleware to grasp the advantages and disadvantages of data Join implementation, and it is extremely difficult to achieve high performance, multi-table Join should be used as little as possible in business reading.

2. What is Mycat

define

  • A completely open source, large database cluster for enterprise application development
  • Support transaction, ACID, can replace MySQL enhanced database
  • An enterprise-class database that can be viewed as a MySQL cluster instead of an expensive Oracle cluster
  • A fusion of memory caching technology, NoSQL technology, HDFS big data of the new SQL Server
  • A new generation of enterprise database products combining traditional database and new distributed data warehouse
  • A novel database middleware product

The website address

Cross-shard data merge

Application scenarios

Up to now, the application scenarios of Mycat have been very rich, and new users are constantly providing new and innovative solutions. The following are some typical application scenarios:

  • Simple read/write separation: supports read/write separation and primary/secondary switchover
  • Table and library: Sharding more than 10 million tables, supporting a maximum of 100 billion single table sharding
  • Multi-tenant applications, one library per application, but the application only connects to Mycat, so that the application itself is not modified and multi-tenant is achieved
  • The report system, with the help of Mycat’s ability to divide tables, handles the statistics of large-scale reports
  • Analyze big data instead of Hbase
  • As a simple and effective solution for real-time query of massive data, for example, 10 billion frequently queried records need to be queried within 3 seconds. In addition to queries based on primary keys, there may also be scope queries or other attribute queries. In this case, Mycat may be the simplest and effective choice

3. Core concepts and configurations in Mycat

The core concept

  • Database middleware

Mycat is database middleware, which is an intermediate service between database and application for data processing and interaction.

  • Logic library (schema)

Usually for practical applications, there is no need to know the existence of middleware, business developers only need to know the concept of database, so database middleware can be regarded as a logical library composed of one or more database clusters

  • Logical table

A logical table can be divided into one or more shard libraries after data is sharded, or it can be composed of only one table without data sharding.

1. Shard table A shard table refers to a table with large data that needs to be shard into multiple databasesCopy the code

Each slice has a piece of data, and all the slices make up the complete data. For example, t_node in myCAT configuration belongs to the partition table. Data is divided into two datanodes (DN1 and DN2) according to rules.

2. Non-sharded tableCopy the code

Not all tables in a database are large, and some tables may not need to be shard. Non-shard refers to tables that do not need to be shard. In the following configuration, T_node exists only on the fragment node (dataNode) DN1.

Installation and Configuration

Based on the source code

  • Mycat-server = mycat-server

  • MyCAT is currently used to define logical libraries and related configurations in configuration files:

MYCAT_HOME/conf/schema. XML defines logical libraries, tables, and shard nodes.

The sharding rules are defined in MYCAT_HOME/conf/ rule-xml.

MYCAT_HOME/conf/server.xml defines users and system related variables such as ports.

  • Source run MyCAT entrance program is org. Opencloudb. MycatStartup. Java, you need to set up MYCAT_HOME mesh

Directory (SRC /main) for your project. After setting MYCAT home directory, you can run MYCAT service normally.

Linux installation

#downloadWget http://dl.mycat.io/1.6.7.3/20190927161129/Mycat-server-1.6.7.3-release- 20190927161129 - Linux. Tar. Gz#Unzip to the mycat directory
#Start the mycat
./bin/mycat start
#stop
./bin/mycat stop
#Restart the service
./bin/mycat restart
#Viewing the Startup Status
./bin/mycat status
Copy the code

Server. XML is the Mycat server parameter adjustment and user authorization configuration file, schema. XML is the logical library definition and table and fragment definition configuration file, rule-xml is the fragment rule configuration file. Specific parameters of the sharding rule are stored in separate files

4. MyCat configuration

  • Bin Boot directory
  • Conf Configuration directory for storing configuration files:
  • Server.xml: is the Mycat server parameter adjustment and user authorization configuration file.
  • Schema. XML: is a configuration file for logical library definitions and table and shard definitions.
  • Rule-xml: is the configuration file of the sharding rule. Parameter information about the sharding rule is stored in this directory separately

Under, configuration file modification needs to restart MyCAT.

  • Log4j. XML: Logs are stored in logs/log, one file per day. The log configuration is in conf/log4j. XML, according to

You can adjust the output level to DEBUG. The debug level outputs more information for troubleshooting.

  • Autopartition – long. TXT, partition – hash – int. TXT, sequence_conf. Properties,

Sequence_db_conf. properties Id of the sharding rule configuration file

  • Lib MyCAT jar package or dependent jar package directory.
  • Logs MyCAT Directory for storing logs. Logs are stored in logs/log, one file per day

The following image describes the three most important configuration files for Mycat:

Configure Mycat environment parameters


      
<! DOCTYPEmycat:server SYSTEM "server.dtd">
<mycat:server xmlns:mycat="http://org.opencloudb/">
 <system>
 <property name="defaultSqlParser">druidparser</property>
  </system>
</mycat:server>
Copy the code

For example, all Mycat parameter variables are configured in the server. XML file under the system tag. If you need to configure a variable, you can add the corresponding configuration, such as adding the boot port 8066.

<property name="serverPort">8066</property>
Copy the code

All other variables are similar.

Configure Mycat logical library and users


      
<! DOCTYPEmycat:server SYSTEM "server.dtd">
<mycat:server xmlns:mycat="http://org.opencloudb/">
<user name="mycat">
<property name="password">mycat</property>
<property name="schemas">TESTDB</property>
</user>
</mycat:server>
Copy the code

For example, all Mycat connection user and logical library mappings are configured in the server.xml file under the user tag. For example, in the example, a Mycat user is configured to connect to Mycat. At the same time, mycat has configured a logical library TESTDB in schema. XML, and the mapping relationship between the logical library and users has been configured.

Configure logical Libraries (Schema)

As a middleware, Mycat implements mysql protocol, so it is a database for the front-end application connection, which also has the database configuration. Mycat database configuration is configured in schema. XML, and mapped to the user in server.xml after configuration.


      
<! DOCTYPEmycat:schema SYSTEM "schema.dtd">
<mycat:schema  xmlns:mycat="http://org.opencloudb/">
 <schema name="TESTDB" checkSQLschema="true" sqlMaxLimit="100" dataNode="dn1">
   <table name="t_user" dataNode="dn1,dn2" rule="sharding-by-mod2"/>
   <table name="ht_jy_login_log" primaryKey="ID" dataNode="dn1,dn2"
rule="sharding-by-date_jylog"/>
 </schema>
 <dataNode name="dn1" dataHost="localhost1" database="mycat_node1"/>
 <dataNode name="dn2" dataHost="localhost1" database="mycat_node2"/>
 <dataHost name="localhost1" writeType="0" switchType="1" slaveThreshold="100"
balance="1" dbType="mysql" maxCon="10" minCon="1" dbDriver="native">
  <heartbeat>show status like 'wsrep%'</heartbeat>
  <writeHost host="hostM1" url="127.0.0.1:3306" user="root" password="root" >
  </writeHost> 
 </dataHost>
</mycat:schema >
Copy the code

In the preceding example, a logical library TESTDB is configured, and two shard tables t_user and ht_jy_login_log are configured.

Logical table Configuration

<table name="t_user" dataNode="dn1,dn2" rule="sharding-by-mod2"/>
Copy the code

The table label is the configuration of the logical table, where name represents the name of the table, dataNode represents the corresponding shard of the table, Mycat adopts the database partition mode by default, that is, a table is mapped to different libraries, rule represents the data partition mode to be adopted by the table, and the name corresponds to the corresponding configuration in rule-.xml. If you want sharding, you have to configure it.

Configuring Fragments (Datanodes)

<dataNode name="dn1" dataHost="localhost1" database="mycat_node1"/>
 <dataNode name="dn2" dataHost="localhost1" database="mycat_node2"/>
Copy the code

The fragment of Mycat is actually the alias of the library. For example, in the above example, two fragments Dn1 and DN2 are configured respectively to the two libraries mapped to dataHost localhost1 of the physical machine.

Configuring physical library Sharding mapping (dataHost)

<dataHost name="localhost1" writeType="0" switchType="1" slaveThreshold="100"
balance="1" dbType="mysql" maxCon="10" minCon="1" dbDriver="native">
  <heartbeat>show status like 'wsrep%'</heartbeat>
  <writeHost host="hostM1" url="127.0.0.1:3306" user="root" password="root" >
  </writeHost> 
 </dataHost>
Copy the code

As a database agent, Mycat needs logical libraries and logical users. After table partitioning, it needs to configure sharding, and the sharding needs to be mapped to the real physical host. Mycat does not care whether it is mapped to one or multiple instances of one, just need to configure the mapping. A physical host (dataHost) mapping named localHost1 is configured.

The heartbeat tag represents the statement that Mycat needs to check the heartbeat of the physical library. Under normal circumstances, the production case may be configured with master/slave, multiple writes, or single library. In either case, Mycat needs to maintain a connection to the data source of the database, so it needs to periodically check the availability of the back-end connection.

WriteHost This tag represents the physical host mapping on the backend of a logical host (dataHost), such as hostM1 mapped to 127.0.0.1:3306 in the example. If the back-end needs to perform read/write separation, multi-write, or primary/secondary operations, configure multiple writeHosts or readHosts.

Tags such as writeType Balance in dataHost are different policies. For details, see the guide.

Table sharding rule configuration


      
<! DOCTYPEmycat:rule SYSTEM "rule.dtd">
<mycat:rule  xmlns:mycat="http://org.opencloudb/">
 <tableRule name="sharding-by-hour">
  <rule>
   <columns>createTime</columns>
   <algorithm>sharding-by-hour</algorithm>
  </rule>
 </tableRule>
 <function name="sharding-by-hour"
class="org.opencloudb.route.function.LatestMonthPartion">
  <property name="splitOneDay">24</property>
 </function>
</mycat:rule >
Copy the code

As the most important configuration in the table sharding rule, the table sharding mode determines the performance after data sharding. Therefore, it is also the most important configuration.

For example, a sharding rule named sharding-by-hour is configured in the above example. Function is sharded by date. In this configuration:

tableRule

Name is the rule=”sharding-by-hour” of the table in schema. XML. Columns are the columns of the table. CreateTime Specifies the creation date. Algorithm is the sharding rule corresponding to a rule: the name mapped to function.

function

The function configuration is the configuration of sharding rules.

Name indicates the name of the sharding rule. The name can be arbitrary, but must match the name in tableRule.

Cut class is segmentation rules corresponding classification, write to death, which rules the configuration which needs, such as this example is shard by the hour: org. Opencloudb. The route. The function. The LatestMonthPartion

The property label is a different attribute corresponding to the sharding rule. Different sharding rules have different configurations.

5.Mycat reading and writing separation actual combat

Mysql Synchronization Principle

Mysql synchronization

Docker configuration

# masterdocker run --name mysql_master -v /root/mysql-cluster/master/:/etc/mysql/conf.d/ -e MYSQL_ROOT_PASSWORD=root -p 3316-3306 - d mysql: 5.7# slave1docker run --name mysql_slave1 -v /root/mysql-cluster/slave1/:/etc/mysql/conf.d/ -e MYSQL_ROOT_PASSWORD=root -p 3326-3306 - d mysql: 5.7# slave2docker run --name mysql_slave2 -v /root/mysql-cluster/slave2/:/etc/mysql/conf.d/ -e MYSQL_ROOT_PASSWORD=root -p 3336-3306 - d mysql: 5.7Copy the code

Mycat is not responsible for any data synchronization problems

<dataHost name="localhost1" maxCon="1000" minCon="10" balance="1"
writeType="0" dbType="mysql" dbDriver="native" >
<heartbeat>show slave status</heartbeat>
<! -- can have multi write hosts -->
<writeHost host="hostM1" url="localhost:3306" user="root" password="123456">
<! -- can have multi read hosts -->
<readHost host="hostS1" url="localhost2:3306" user="root" password="123456"
weight="1" />
</writeHost>
</dataHost>
Copy the code
Forced to follow:/ *! mycat:db_type=slave*/ select * fromTravelrecord forces you to write:/ *! mycat:db_type=master*/ select * from travelrecord
Copy the code

6.Mycat Global serial number

  • Local file mode

Configuration schema. The XML

<table name="travelrecord" dataNode="dn1" autoIncrement="true"
primaryKey="id" />
Copy the code

Editing server. XML

# Change primary key generation policy<property name="sequnceHandlerType">0</property>
Copy the code

Edit sequence_conf. The properties

#MINID Minimum ID #MAXID Maximum ID #CURID Current ID TravelRecord.hisids=
TRAVELRECORD.MINID=10001
TRAVELRECORD.MAXID=20000
TRAVELRECORD.CURID=10000
Copy the code
  • Local timestamp

ID= 64-bit binary (42(ms)+5(machine ID)+5(business code)+12(repeated accumulation) configuration schema.xml

<table name="travelrecord" dataNode="dn1" autoIncrement="true" primaryKey="id"/>
Copy the code

Configure the server. The XML

<property name="sequnceHandlerType">2</property>
Copy the code

Sequence_time_conf.properties under mycat

DATAACENTERID=0-31 # Any integerCopy the code

The WORKID and DATAACENTERID configured for each myCAT under multiple MyCAT nodes are different, which constitute a unique identifier. A total of 32*32=1024 combinations are supported.

The WORKID and DATAACENTERID configured for each myCAT under multiple MyCAT nodes are different, which constitute a unique identifier. A total of 32*32=1024 combinations are supported.