MySQL to ES data synchronization is essentially a de-normalization of data. In this section, we expand the technical solution of “MySQL to ES data migration synchronization”, and provide some ideas for readers by comparing their advantages and disadvantages and application scenarios.

Go to the canonical

Database normalization is a strategy to reduce redundancy and enhance data consistency through a series of database paradigms in relational databases.

But why do you want to de-normalize?

While normalization brings tangible benefits (better operational performance, lower storage costs), its disadvantages gradually show up as data scale increases, concurrency increases, and complexity increases.

In this case, database de-normalization can meet these challenges to some extent. The general idea is to improve database read performance through a series of operations that reduce write performance, such as more data redundancy, data grouping, etc.

Time to de-normalize

There are various motivations for data de-normalization, which are triggered when complex data operations affect system stability and business response/concurrency requirements are not met.

Service stability problem: C-oriented Internet applications are characterized by high concurrency, AND SQL tends to point-and-check and point-write, which is relatively simple. However, the operation of precipitated data often involves the operation characteristics of traditional enterprise-level applications for databases. Large-scale data deletion, table association, sorting and other real-time operations, as well as to meet the report /BI and other more complex database requirements. Through de-normalization and load separation is a reasonable choice.

Complex query performance issues: Enterprise-level applications often involve table association, aggregation, multidimensional filtering, sorting, and other operations, and often cause performance problems. Significant performance improvements can be achieved through de-normalization in several ways, such as data redundancy and predictive computation, as described below.

Several implementations of de-normalization

The column level handles redundant fields in the primary query table

By redundantly calculating good data in the primary table, you can avoid frequent recomputing of data. The following scenarios are suitable for data redundancy in the master table:

  • Applications often need to obtain calculated data
  • Redundant raw data does not change very often

Advantages: The method is simple and easy to implement

Cons: Intrusion into the business logic, slowing down the performance of the business code, and changes resulting from long iterations may pose stability risks

Table level processing wide table prebuild/Cube prebuild

The main operation is to build a wide table, or a Data Cube. The built wide table contains all dimension and metric information that users need to query.

Common table-level processing includes: application overwrite, materialized views implemented by the database itself, and data migration synchronization.

Application to write

When a wide table is created in the same database as the master data and data is written to the wide table at the same time (transactions ensure consistency), complex queries can be performed from this table.

Advantages: Simple implementation and low cost

Disadvantages: More read and write pressure on the master database, plus business transformation costs

RDBMS materializes views

A materialized view (snapshot) is a database read-only object that contains a query result, is a local copy of remote data, or is used to generate a summary table based on data table summation. Reduce join and aggregation through data redundancy and prediction to improve query performance.

Advantages: Database engine support itself, the use of low cost

Disadvantages: RDBMS implementations have limitations, such as the ability to generate materialized views of data that require some business-dependent transformations or that some databases do not fully implement

Data migration synchronization

With the help of the data synchronization tool, the data of the master data table is organized and transformed (including transformation according to the service logic) into a common table or a large-width table in quasi-real time and written into a third-party storage engine. Complex queries are executed directly on pre-built tables or cubes to achieve good performance. There are many data migration tools, which can be divided into the following categories according to their emphasis:

Big data class, flow computing class, message class, database class, cloud vendor class, professional data migration synchronization tool

Advantages:

Master database is more stable: asynchronously resolve the coupling of business system transaction query and complex query, avoid complex query to the master database impact.

Easy operation and maintenance and stable link: Data migration synchronous link is supported by standardized products and decoupled from the read and write of the main service system and the main library. The overall architecture has clear responsibilities, easy maintenance and problem tracking.

Disadvantages: the need for a variety of data migration synchronization tools, bearing complex query database product selection.

MySQL to ES data real-time synchronization technology architecture

This paper introduces the technical solution of “MySQL to ES data migration synchronization”

Why Mysql?

High concurrency: the MySQL kernel features are particularly suited for high-concurrency simple SQL operations, link lightweight (thread mode), and relatively simple optimizer, executor, and transaction engine.

Good stability: the biggest requirement of the master database is to be stable and not lose data. The primary and secondary systems of MySQL can switch quickly in case of crash, and innoDB storage engine also ensures stable MySQL disk.

Convenient operation: good and convenient user experience, easy to use, low learning cost.

Open source ecology: MySQL is open source, which makes it relatively simple for upstream and downstream manufacturers to build tools around it. HA proxy and sub-database and sub-table middleware greatly enhance its practicability. At the same time, the characteristics of open source enable it to have a large number of users.

Why ES?

ES has several notable features that can effectively complement the defects of MySQL in enterprise-level data operation scenarios

Text search capability: ES is a search system based on inverted index. With various word segmentation, it performs well in text fuzzy matching search and has a wide range of business scenarios.

Multi-dimensional filtering performance is good: multi-billion scale data uses wide table pre-construction (eliminating join), with full field index, so that ES has the overwhelming advantage in multi-dimensional filtering ability, plus text search ability, unique.

Open source and commercial parallelism: ES has a very active open source ecosystem, with a large number of user groups. Meanwhile, it is also supported by independent commercial companies, which enables users to have more diverse and progressive choices according to their own characteristics.

Why data migration and synchronization?

Good stability: The operation of migration synchronization is mainly to read data and logs in sequence for the main database, and the concurrency is small at the same time, which has little influence on the stability of the main database (many downstream subscriptions may have influence on the network level, which is generally solved by messages). In addition, the replayable nature of logs (Binlog/WAL/Redo, etc.) greatly reduces the possibility of downstream data loss (in case of good idempotent handling)

Business decoupling: In general, the primary database hosts more transactional operations, while downstream data systems host operations and other layers of business.

Low service intrusion: Data migration and synchronization has no service intrusion, and it is convenient to find mature solutions or products in various directions, such as open source, commercial and cloud, when the standard database (source) is connected at both ends.

Good service suitability: Some data migration and synchronization products can embed business logic to enable downstream users to obtain data closer to the business, thus making data services more efficient and convenient.

Data migration synchronization model selection

Subscribe to consumer

advantages:

Stacking capability: Because message queues are introduced, the entire link is capable of stacking change data. If the change data is consumed slowly and older MySQL local binlog files are deleted due to disk space shortage, the message queue data still exists and data synchronization can still take place.

Data distribution capability: Multi-party subscriptions can be supported after message queuing is introduced. If multiple downstream applications depend on change data from the source, you can simply subscribe to the same topic.

Data processing capability: Since the change data is subscribed by downstream consumers, it is flexible to do some data processing after subscription. For example, it is more convenient to call the microservice interface from the outside or reverse check some data to do data processing.

Disadvantages:

Relatively high O&M cost: The system contains many components and applications, and the o&M guarantee is relatively complex.

High stability risk: If one link fails, the stability of the entire data synchronization link is affected. And troubleshooting and locating problems can be difficult.

End-to-end direct connection

Advantages:

Low latency: Direct end-to-end synchronization with short links and low latency

Good stability: fewer link components, low probability of failure, easy to locate and troubleshoot. Suitable for stringent scenarios with high data accuracy.

Strong function expansion: the peer end writing message system, analog subscription mode, strong scalability

O&m deployment detection: With fewer link components, o&M deployment is simpler

Summary of data migration synchronization model selection

If there are not many downstream data subscriptions, the direct connection mode is recommended. Data synchronization links are often deployed in online services. As the scale and importance of services increase, link stability becomes more important.

In addition, the end-to-end mode can realize the subscription mode immediately as long as the peer data source is supported as message middleware, and the data processing ability can be solved by uploading business code to run on some data migration and synchronization products.

In addition to meeting service requirements, the simplicity and clarity of the data architecture make it easier to maintain and troubleshoot the system. When a large amount of data is synchronized to links every day, a few links are occasionally lost and need to be rectified, or the synchronization links are stuck and not synchronized, the end-to-end mode provides considerable advantages.

Design of MySQL associated table on ES

Table join relationships in relational databases can be expressed by several data types in ES, including objected, nested, and JOIN.

objected

The object type can store nested structures.

Advantages: Represents the one-to-many relationship between the primary field and the internal field of an object, and supports doc join query. Because the associated data that all queries rely on are also in one document, the join within ES is avoided and the query efficiency is high.

Disadvantages: One-to-many relationships can only preserve one layer, more than one layer will be flattened, will lose the relationship inside the nested field.

nested

The nested type can store nested structures and represent a one-to-many relationship, which is an extension of the object type.

Advantages:

There will be no shortcomings of Object, the entire nesting relationship is completely maintained, the sub-document internal association relationship is intact

Associated data is naturally associated to the main document by implementation and performs better when searching (as opposed to joining)

Disadvantages:

A nested field can belong to only one master document

In the nested type, there is a strong binding between the subdocument and the master document, which forces the subdocument to be updated when the master document is updated. New energy costs more in scenarios where you write more and read less.

A query for a child document must go through the parent document to find the child document

Frequent changes to child documents can affect other child documents and parent documents because lucene’s implementation is essentially a redundant archive under the parent document

join

The join type can be configured with parent and child documents to achieve one-to-many capability. Only one index can be created. This type is more flexible than the nested type. Parent and child documents are related by parentId, so in practice they are independent documents.

Advantages:

Child document updates do not affect the parent document or other child documents

A subdocument can be searched separately

A document can choose which parent it belongs to as a child document. Relation allows you to specify different join columns

Disadvantages:

A global ordinal number is required to serve the relationship between parent and child documents, which can affect search performance

Join and NESTED types are compared

Join is suitable for the scenario with many writes and few reads, and is more suitable for the scenario that focuses on index performance. This means that updates take effect more quickly, but are relatively expensive to search for

Nested is suitable for scenarios that read more and write less, and pay more attention to the search performance

De-normalized implementation

Here are some ways to de-normalize data synchronization:

Redundant data in the primary table

The business side pre-redundancy some relational data needed for query in a field of the source table.

advantagesThe processing mode can cope with one-to-many association relationships, and has low requirements on the function of the data synchronization tool and simple configuration. You only need to support single synchronization to ES.

Disadvantages:

Sub-optimal index, search performance: The ES is not supplied with built wide table data. This implementation has additional overhead in terms of index and search performance and is not the best performance implementation.

Service system intrusion: The service system needs to write additional information when writing master data.

Redundant data in primary database tables: A relational database table has too much redundant information about other tables, resulting in storage and performance overhead.

Conclusion: This method is not recommended

Multiple table subscriptions and prebuild wide table data

The data synchronization tool subscribes to all the tables that the search relies on at the same time. The data that arrives first is advanced to ES, and the fields with no data come are empty.

Advantages:

Data synchronization tool Configuration synchronization task is simple, no service intrusion, no coupling service system logic.

It has low requirements on data synchronization tools and does not require additional functions and features except synchronization.

The prebuilt wide table based approach also has better index and query performance on ES.

The synchronization link does not block the synchronization of the entire data link because some columns of the wide table are missing.

Disadvantages: Based on the fact table active trigger mode to build a wide table. For tables subscribed to at the source, if few or no updates produce binlog, the column values in the document at the other end may remain incomplete, resulting in poor timeliness. Some columns of data are missing when searching.

Conclusion:

It is suitable for the scenario where the data of the fact table that forms the wide table is written to the disk together with the transaction guarantee, so that the peer ES can avoid searching for incomplete data.

This method is suitable for scenarios that do not require service processing to build a wide table. To build a wide table, only the columns of multiple tables are joined together to form a wide table.

The synchronization process backchecks the prebuild

The tables to which the data synchronization tool subscribes are called master tables. During data synchronization, the tables queried by reverse lookup are called slave tables. Using the ability of data synchronization tool itself, during the period of subscribing to the main table, the column in the wide table is automatically filled by the way of looking back, forming a complete wide table row data.

Advantages:

Building a wide table based on reverse lookup has good flexibility. It can do some light data processing on the slave table data based on the data of the master table before generating a wide table.

A master table of data, through the reverse lookup to generate wide table row, can cooperate with data processing to generate multiple wide table row data.

Cross-instance joins can be easily implemented based on backward lookup to generate wide table rows.

The method based on wide table pre-build has better index and query performance in ES.

Disadvantages:

The data may not be ready during the reverse check, resulting in data confirmation

Data synchronization tools are needed to support data contrast and data processing

Conclusion:

Suitable for scenarios where building wide tables involves data processing

reference

MySQL data real-time synchronization to Elasticsearch technology selection and thinking

The difference between materialized and normal views